Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events

Audio as a First-Class Modality

While text and images dominate AI agent discussions, audio carries information that other modalities cannot. Tone of voice reveals sentiment. Background sounds provide context. Speaker identity matters for meeting transcription. An audio analysis agent goes beyond simple speech-to-text — it understands the full audio landscape.

Core Dependencies

pip install openai librosa numpy soundfile torch torchaudio
pip install pyannote.audio  # for speaker diarization

Audio Feature Extraction

Before classification, extract meaningful features from raw audio. Librosa provides the standard toolkit for audio feature analysis:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

import librosa
import numpy as np
from dataclasses import dataclass

@dataclass
class AudioFeatures:
    duration_seconds: float
    sample_rate: int
    tempo: float
    spectral_centroid_mean: float
    mfcc_means: list[float]
    rms_energy_mean: float
    zero_crossing_rate: float
    is_speech_likely: bool

def extract_features(audio_path: str) -> AudioFeatures:
    """Extract audio features for classification."""
    y, sr = librosa.load(audio_path, sr=22050)
    duration = librosa.get_duration(y=y, sr=sr)

    # Tempo estimation
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
    tempo_value = float(tempo) if np.isscalar(tempo) else float(tempo[0])

    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(
        y=y, sr=sr
    )

    # MFCCs — standard for audio classification
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfcc_means = [float(np.mean(mfccs[i])) for i in range(13)]

    # Energy
    rms = librosa.feature.rms(y=y)

    # Zero crossing rate — high for speech, low for music
    zcr = librosa.feature.zero_crossing_rate(y)

    zcr_mean = float(np.mean(zcr))
    is_speech = zcr_mean > 0.05 and float(np.mean(rms)) < 0.1

    return AudioFeatures(
        duration_seconds=duration,
        sample_rate=sr,
        tempo=tempo_value,
        spectral_centroid_mean=float(np.mean(spectral_centroid)),
        mfcc_means=mfcc_means,
        rms_energy_mean=float(np.mean(rms)),
        zero_crossing_rate=zcr_mean,
        is_speech_likely=is_speech,
    )

Speaker Diarization

Speaker diarization answers the question "who spoke when" — essential for meeting transcription and multi-party audio analysis:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from pyannote.audio import Pipeline
import torch

@dataclass
class SpeakerSegment:
    speaker: str
    start: float
    end: float
    duration: float

def diarize_speakers(
    audio_path: str, hf_token: str
) -> list[SpeakerSegment]:
    """Identify different speakers and their time segments."""
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=hf_token,
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    pipeline = pipeline.to(device)

    diarization = pipeline(audio_path)

    segments = []
    for turn, _, speaker in diarization.itertracks(
        yield_label=True
    ):
        segments.append(SpeakerSegment(
            speaker=speaker,
            start=round(turn.start, 2),
            end=round(turn.end, 2),
            duration=round(turn.end - turn.start, 2),
        ))

    return segments

Transcription with Speaker Labels

Combine diarization with Whisper transcription to produce speaker-labeled transcripts:

import openai

async def transcribe_with_speakers(
    audio_path: str,
    segments: list[SpeakerSegment],
    client: openai.AsyncOpenAI,
) -> list[dict]:
    """Transcribe audio and align with speaker diarization."""
    # First, get the full transcript with timestamps
    with open(audio_path, "rb") as f:
        transcript = await client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )

    # Align transcript segments with speaker labels
    labeled_segments = []
    for t_seg in transcript.segments:
        seg_mid = (t_seg["start"] + t_seg["end"]) / 2
        speaker = "Unknown"
        for s_seg in segments:
            if s_seg.start <= seg_mid <= s_seg.end:
                speaker = s_seg.speaker
                break

        labeled_segments.append({
            "speaker": speaker,
            "start": t_seg["start"],
            "end": t_seg["end"],
            "text": t_seg["text"].strip(),
        })

    return labeled_segments

Sound Event Detection

Beyond speech, detect environmental sounds and events:

async def detect_sound_events(
    audio_path: str, client: openai.AsyncOpenAI
) -> list[dict]:
    """Use GPT-4o audio capabilities to detect sound events."""
    # Encode audio for API
    import base64

    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    b64_audio = base64.b64encode(audio_bytes).decode()

    # GPT-4o with audio understanding
    response = await client.chat.completions.create(
        model="gpt-4o-audio-preview",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Listen to this audio and identify all "
                        "distinct sound events. For each event, "
                        "provide the approximate timestamp, type "
                        "of sound, and confidence level. Return "
                        "as a JSON array."
                    ),
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": b64_audio,
                        "format": "wav",
                    },
                },
            ],
        }],
        response_format={"type": "json_object"},
    )

    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("events", [])

The Audio Analysis Agent

class AudioAnalysisAgent:
    def __init__(self, hf_token: str | None = None):
        self.client = openai.AsyncOpenAI()
        self.hf_token = hf_token

    async def analyze(self, audio_path: str) -> dict:
        features = extract_features(audio_path)

        result = {
            "duration": features.duration_seconds,
            "tempo": features.tempo,
            "is_speech": features.is_speech_likely,
        }

        if features.is_speech_likely and self.hf_token:
            segments = diarize_speakers(audio_path, self.hf_token)
            transcript = await transcribe_with_speakers(
                audio_path, segments, self.client
            )
            unique_speakers = set(s.speaker for s in segments)
            result["speaker_count"] = len(unique_speakers)
            result["transcript"] = transcript
        else:
            events = await detect_sound_events(
                audio_path, self.client
            )
            result["sound_events"] = events

        return result

FAQ

What is the difference between speaker diarization and speaker identification?

Speaker diarization answers "who spoke when" by segmenting audio into speaker turns and labeling them as Speaker 1, Speaker 2, and so on — without knowing who those speakers are. Speaker identification matches voice segments against a known database of speaker voiceprints to determine the actual identity. Diarization is unsupervised and works on any audio, while identification requires pre-enrolled speaker profiles.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How accurate is pyannote for speaker diarization in noisy environments?

Pyannote 3.1 achieves strong results in clean recordings (under 5% diarization error rate) but degrades in noisy environments, overlapping speech, and phone-quality audio. For noisy recordings, preprocess with noise reduction (using libraries like noisereduce) before diarization. Also consider increasing the minimum segment duration to avoid spurious speaker switches caused by noise.

Can I classify music genres using the extracted audio features?

Yes. The MFCC features, spectral centroid, tempo, and zero crossing rate are the classic features used for genre classification. Train a simple classifier (random forest or small neural network) on a labeled dataset like GTZAN. Alternatively, skip manual feature engineering and use a pretrained audio classification model like those from Hugging Face's audio transformers, which accept raw waveforms and output genre labels directly.

#AudioAnalysis #SpeakerDiarization #SoundClassification #AudioFeatures #Python #AgenticAI #LearnAI #AIEngineering

Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events

Audio as a First-Class Modality

Core Dependencies

Audio Feature Extraction

Speaker Diarization

Transcription with Speaker Labels

Sound Event Detection

The Audio Analysis Agent

FAQ

What is the difference between speaker diarization and speaker identification?

How accurate is pyannote for speaker diarization in noisy environments?

Can I classify music genres using the extracted audio features?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Deploy a Voice Agent on Modal with Python and Serverless GPU

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings

Docker Multi-Stage AI Agent Images: uv + Distroless = 80MB (2026)