---
title: "Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events"
description: "Build an audio analysis agent in Python that classifies music genres, identifies speakers through diarization, and detects sound events. Covers audio feature extraction, classification models, and structured audio understanding."
canonical: https://callsphere.ai/blog/audio-analysis-agent-music-classification-speaker-identification-sound-events
category: "Learn Agentic AI"
tags: ["Audio Analysis", "Speaker Diarization", "Sound Classification", "Audio Features", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.728Z
---

# Audio Analysis Agent: Music Classification, Speaker Identification, and Sound Events

> Build an audio analysis agent in Python that classifies music genres, identifies speakers through diarization, and detects sound events. Covers audio feature extraction, classification models, and structured audio understanding.

## Audio as a First-Class Modality

While text and images dominate AI agent discussions, audio carries information that other modalities cannot. Tone of voice reveals sentiment. Background sounds provide context. Speaker identity matters for meeting transcription. An audio analysis agent goes beyond simple speech-to-text — it understands the full audio landscape.

## Core Dependencies

```bash
pip install openai librosa numpy soundfile torch torchaudio
pip install pyannote.audio  # for speaker diarization
```

## Audio Feature Extraction

Before classification, extract meaningful features from raw audio. Librosa provides the standard toolkit for audio feature analysis:

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
import librosa
import numpy as np
from dataclasses import dataclass

@dataclass
class AudioFeatures:
    duration_seconds: float
    sample_rate: int
    tempo: float
    spectral_centroid_mean: float
    mfcc_means: list[float]
    rms_energy_mean: float
    zero_crossing_rate: float
    is_speech_likely: bool

def extract_features(audio_path: str) -> AudioFeatures:
    """Extract audio features for classification."""
    y, sr = librosa.load(audio_path, sr=22050)
    duration = librosa.get_duration(y=y, sr=sr)

    # Tempo estimation
    tempo, _ = librosa.beat.beat_track(y=y, sr=sr)
    tempo_value = float(tempo) if np.isscalar(tempo) else float(tempo[0])

    # Spectral features
    spectral_centroid = librosa.feature.spectral_centroid(
        y=y, sr=sr
    )

    # MFCCs — standard for audio classification
    mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
    mfcc_means = [float(np.mean(mfccs[i])) for i in range(13)]

    # Energy
    rms = librosa.feature.rms(y=y)

    # Zero crossing rate — high for speech, low for music
    zcr = librosa.feature.zero_crossing_rate(y)

    zcr_mean = float(np.mean(zcr))
    is_speech = zcr_mean > 0.05 and float(np.mean(rms))  list[SpeakerSegment]:
    """Identify different speakers and their time segments."""
    pipeline = Pipeline.from_pretrained(
        "pyannote/speaker-diarization-3.1",
        use_auth_token=hf_token,
    )

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    pipeline = pipeline.to(device)

    diarization = pipeline(audio_path)

    segments = []
    for turn, _, speaker in diarization.itertracks(
        yield_label=True
    ):
        segments.append(SpeakerSegment(
            speaker=speaker,
            start=round(turn.start, 2),
            end=round(turn.end, 2),
            duration=round(turn.end - turn.start, 2),
        ))

    return segments
```

## Transcription with Speaker Labels

Combine diarization with Whisper transcription to produce speaker-labeled transcripts:

```python
import openai

async def transcribe_with_speakers(
    audio_path: str,
    segments: list[SpeakerSegment],
    client: openai.AsyncOpenAI,
) -> list[dict]:
    """Transcribe audio and align with speaker diarization."""
    # First, get the full transcript with timestamps
    with open(audio_path, "rb") as f:
        transcript = await client.audio.transcriptions.create(
            model="whisper-1",
            file=f,
            response_format="verbose_json",
            timestamp_granularities=["segment"],
        )

    # Align transcript segments with speaker labels
    labeled_segments = []
    for t_seg in transcript.segments:
        seg_mid = (t_seg["start"] + t_seg["end"]) / 2
        speaker = "Unknown"
        for s_seg in segments:
            if s_seg.start  list[dict]:
    """Use GPT-4o audio capabilities to detect sound events."""
    # Encode audio for API
    import base64

    with open(audio_path, "rb") as f:
        audio_bytes = f.read()
    b64_audio = base64.b64encode(audio_bytes).decode()

    # GPT-4o with audio understanding
    response = await client.chat.completions.create(
        model="gpt-4o-audio-preview",
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": (
                        "Listen to this audio and identify all "
                        "distinct sound events. For each event, "
                        "provide the approximate timestamp, type "
                        "of sound, and confidence level. Return "
                        "as a JSON array."
                    ),
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": b64_audio,
                        "format": "wav",
                    },
                },
            ],
        }],
        response_format={"type": "json_object"},
    )

    import json
    result = json.loads(response.choices[0].message.content)
    return result.get("events", [])
```

## The Audio Analysis Agent

```python
class AudioAnalysisAgent:
    def __init__(self, hf_token: str | None = None):
        self.client = openai.AsyncOpenAI()
        self.hf_token = hf_token

    async def analyze(self, audio_path: str) -> dict:
        features = extract_features(audio_path)

        result = {
            "duration": features.duration_seconds,
            "tempo": features.tempo,
            "is_speech": features.is_speech_likely,
        }

        if features.is_speech_likely and self.hf_token:
            segments = diarize_speakers(audio_path, self.hf_token)
            transcript = await transcribe_with_speakers(
                audio_path, segments, self.client
            )
            unique_speakers = set(s.speaker for s in segments)
            result["speaker_count"] = len(unique_speakers)
            result["transcript"] = transcript
        else:
            events = await detect_sound_events(
                audio_path, self.client
            )
            result["sound_events"] = events

        return result
```

## FAQ

### What is the difference between speaker diarization and speaker identification?

Speaker diarization answers "who spoke when" by segmenting audio into speaker turns and labeling them as Speaker 1, Speaker 2, and so on — without knowing who those speakers are. Speaker identification matches voice segments against a known database of speaker voiceprints to determine the actual identity. Diarization is unsupervised and works on any audio, while identification requires pre-enrolled speaker profiles.

### How accurate is pyannote for speaker diarization in noisy environments?

Pyannote 3.1 achieves strong results in clean recordings (under 5% diarization error rate) but degrades in noisy environments, overlapping speech, and phone-quality audio. For noisy recordings, preprocess with noise reduction (using libraries like noisereduce) before diarization. Also consider increasing the minimum segment duration to avoid spurious speaker switches caused by noise.

### Can I classify music genres using the extracted audio features?

Yes. The MFCC features, spectral centroid, tempo, and zero crossing rate are the classic features used for genre classification. Train a simple classifier (random forest or small neural network) on a labeled dataset like GTZAN. Alternatively, skip manual feature engineering and use a pretrained audio classification model like those from Hugging Face's audio transformers, which accept raw waveforms and output genre labels directly.

---

#AudioAnalysis #SpeakerDiarization #SoundClassification #AudioFeatures #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/audio-analysis-agent-music-classification-speaker-identification-sound-events
