Skip to content
Learn Agentic AI
Learn Agentic AI13 min read1 views

Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events

Build an audio analysis agent in Python that classifies music genres, identifies speakers through diarization, and detects sound events. Covers audio feature extraction, classification models, and structured audio understanding.

Audio as a First-Class Modality

While text and images dominate AI agent discussions, audio carries information that other modalities cannot. Tone of voice reveals sentiment. Background sounds provide context. Speaker identity matters for meeting transcription. An audio analysis agent goes beyond simple speech-to-text — it understands the full audio landscape.

Core Dependencies

pip install openai librosa numpy soundfile torch torchaudio
pip install pyannote.audio  # for speaker diarization

Audio Feature Extraction

Before classification, extract meaningful features from raw audio. Librosa provides the standard toolkit for audio feature analysis:

flowchart TD
    START["Audio Analysis Agent: Music Classification, Speak…"] --> A
    A["Audio as a First-Class Modality"]
    A --> B
    B["Core Dependencies"]
    B --> C
    C["Audio Feature Extraction"]
    C --> D
    D["Speaker Diarization"]
    D --> E
    E["Transcription with Speaker Labels"]
    E --> F
    F["Sound Event Detection"]
    F --> G
    G["The Audio Analysis Agent"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import librosa
import numpy as np
from dataclasses import dataclass


@dataclass
class AudioFeatures:
    duration_seconds: float
    sample_rate: int
    tempo: float
    spectral_centroid_mean: float
    mfcc_means: list[float]
    rms_energy_mean: float
    zero_crossing_rate: float
    is_speech_likely: bool


def extract_features(audio_path: str) -> AudioFeatures:
    """Extract audio features for classification."""
    y, sr = librosa.load(audio_path, sr=22050)
    duration = librosa.get_duration(y=y, sr=sr)

    # Tempo estimation
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
    tempo_value = float(tempo) if np.isscalar(tempo) else float(tempo[0])

    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(
        y=y, sr=sr
    )

    # MFCCs — standard for audio classification
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfcc_means = [float(np.mean(mfccs[i])) for i in range(13)]

    # Energy
    rms = librosa.feature.rms(y=y)

    # Zero crossing rate — high for speech, low for music
    zcr = librosa.feature.zero_crossing_rate(y)

    zcr_mean = float(np.mean(zcr))
    is_speech = zcr_mean > 0.05 and float(np.mean(rms)) < 0.1

    return AudioFeatures(
        duration_seconds=duration,
        sample_rate=sr,
        tempo=tempo_value,
        spectral_centroid_mean=float(np.mean(spectral_centroid)),
        mfcc_means=mfcc_means,
        rms_energy_mean=float(np.mean(rms)),
        zero_crossing_rate=zcr_mean,
        is_speech_likely=is_speech,
    )

Speaker Diarization

Speaker diarization answers the question "who spoke when" — essential for meeting transcription and multi-party audio analysis:

from pyannote.audio import Pipeline
import torch


@dataclass
class SpeakerSegment:
    speaker: str
    start: float
    end: float
    duration: float


def diarize_speakers(
    audio_path: str, hf_token: str
) -> list[SpeakerSegment]:
    """Identify different speakers and their time segments."""
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=hf_token,
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    pipeline = pipeline.to(device)

    diarization = pipeline(audio_path)

    segments = []
    for turn, _, speaker in diarization.itertracks(
        yield_label=True
    ):
        segments.append(SpeakerSegment(
            speaker=speaker,
            start=round(turn.start, 2),
            end=round(turn.end, 2),
            duration=round(turn.end - turn.start, 2),
        ))

    return segments

Transcription with Speaker Labels

Combine diarization with Whisper transcription to produce speaker-labeled transcripts:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import openai


async def transcribe_with_speakers(
    audio_path: str,
    segments: list[SpeakerSegment],
    client: openai.AsyncOpenAI,
) -> list[dict]:
    """Transcribe audio and align with speaker diarization."""
    # First, get the full transcript with timestamps
    with open(audio_path, "rb") as f:
        transcript = await client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )

    # Align transcript segments with speaker labels
    labeled_segments = []
    for t_seg in transcript.segments:
        seg_mid = (t_seg["start"] + t_seg["end"]) / 2
        speaker = "Unknown"
        for s_seg in segments:
            if s_seg.start <= seg_mid <= s_seg.end:
                speaker = s_seg.speaker
                break

        labeled_segments.append({
            "speaker": speaker,
            "start": t_seg["start"],
            "end": t_seg["end"],
            "text": t_seg["text"].strip(),
        })

    return labeled_segments

Sound Event Detection

Beyond speech, detect environmental sounds and events:

async def detect_sound_events(
    audio_path: str, client: openai.AsyncOpenAI
) -> list[dict]:
    """Use GPT-4o audio capabilities to detect sound events."""
    # Encode audio for API
    import base64

    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    b64_audio = base64.b64encode(audio_bytes).decode()

    # GPT-4o with audio understanding
    response = await client.chat.completions.create(
        model="gpt-4o-audio-preview",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Listen to this audio and identify all "
                        "distinct sound events. For each event, "
                        "provide the approximate timestamp, type "
                        "of sound, and confidence level. Return "
                        "as a JSON array."
                    ),
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": b64_audio,
                        "format": "wav",
                    },
                },
            ],
        }],
        response_format={"type": "json_object"},
    )

    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("events", [])

The Audio Analysis Agent

class AudioAnalysisAgent:
    def __init__(self, hf_token: str | None = None):
        self.client = openai.AsyncOpenAI()
        self.hf_token = hf_token

    async def analyze(self, audio_path: str) -> dict:
        features = extract_features(audio_path)

        result = {
            "duration": features.duration_seconds,
            "tempo": features.tempo,
            "is_speech": features.is_speech_likely,
        }

        if features.is_speech_likely and self.hf_token:
            segments = diarize_speakers(audio_path, self.hf_token)
            transcript = await transcribe_with_speakers(
                audio_path, segments, self.client
            )
            unique_speakers = set(s.speaker for s in segments)
            result["speaker_count"] = len(unique_speakers)
            result["transcript"] = transcript
        else:
            events = await detect_sound_events(
                audio_path, self.client
            )
            result["sound_events"] = events

        return result

FAQ

What is the difference between speaker diarization and speaker identification?

Speaker diarization answers "who spoke when" by segmenting audio into speaker turns and labeling them as Speaker 1, Speaker 2, and so on — without knowing who those speakers are. Speaker identification matches voice segments against a known database of speaker voiceprints to determine the actual identity. Diarization is unsupervised and works on any audio, while identification requires pre-enrolled speaker profiles.

How accurate is pyannote for speaker diarization in noisy environments?

Pyannote 3.1 achieves strong results in clean recordings (under 5% diarization error rate) but degrades in noisy environments, overlapping speech, and phone-quality audio. For noisy recordings, preprocess with noise reduction (using libraries like noisereduce) before diarization. Also consider increasing the minimum segment duration to avoid spurious speaker switches caused by noise.

Can I classify music genres using the extracted audio features?

Yes. The MFCC features, spectral centroid, tempo, and zero crossing rate are the classic features used for genre classification. Train a simple classifier (random forest or small neural network) on a labeled dataset like GTZAN. Alternatively, skip manual feature engineering and use a pretrained audio classification model like those from Hugging Face's audio transformers, which accept raw waveforms and output genre labels directly.


#AudioAnalysis #SpeakerDiarization #SoundClassification #AudioFeatures #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

Build a Customer Support Agent from Scratch: Python, OpenAI, and Twilio in 60 Minutes

Step-by-step tutorial to build a production-ready customer support AI agent using Python FastAPI, OpenAI Agents SDK, and Twilio Voice with five integrated tools.

Learn Agentic AI

LangGraph Agent Patterns 2026: Building Stateful Multi-Step AI Workflows

Complete LangGraph tutorial covering state machines for agents, conditional edges, human-in-the-loop patterns, checkpointing, and parallel execution with full code examples.