Classification with Structured Outputs: Sentiment, Intent, and Category Detection

Classification as Structured Extraction

Text classification is a special case of structured output: instead of extracting free-form entities, you are constraining the model to pick from a fixed set of labels. Structured outputs turn this into a reliable, repeatable process by using enums to restrict possible values and numeric fields for confidence scores.

This approach gives you three things that prompt-only classification cannot: guaranteed valid labels, calibrated confidence scores, and multi-label support — all in a single API call.

Simple Sentiment Analysis

Start with the most basic classification task:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from pydantic import BaseModel, Field
from typing import Literal
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(description="Brief explanation for the classification")

def classify_sentiment(text: str) -> SentimentResult:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=SentimentResult,
        messages=[
            {
                "role": "system",
                "content": "Classify the sentiment of the given text. Be precise with confidence scores."
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_sentiment("The product works great but shipping took forever.")
print(result)
# SentimentResult(sentiment='neutral', confidence=0.65, reasoning='Mixed sentiment...')

The Literal type ensures the model can only return one of the three valid labels. No post-processing needed to handle misspellings or creative label names.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Intent Detection for Chatbots

Customer support systems need to route messages by intent. Define a comprehensive intent taxonomy:

from enum import Enum
from typing import List

class CustomerIntent(str, Enum):
    billing_inquiry = "billing_inquiry"
    technical_support = "technical_support"
    account_management = "account_management"
    product_information = "product_information"
    complaint = "complaint"
    cancellation = "cancellation"
    feedback = "feedback"
    general_question = "general_question"

class IntentClassification(BaseModel):
    primary_intent: CustomerIntent
    secondary_intent: CustomerIntent | None = None
    confidence: float = Field(ge=0.0, le=1.0)
    urgency: Literal["low", "medium", "high", "critical"]
    suggested_department: str

def classify_intent(message: str) -> IntentClassification:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=IntentClassification,
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify the customer message intent. "
                    "Assign urgency based on customer frustration level "
                    "and business impact. Suggest the right department to handle it."
                )
            },
            {"role": "user", "content": message}
        ],
    )

result = classify_intent("I've been charged twice for my subscription and nobody is responding!")
print(result.primary_intent)      # CustomerIntent.billing_inquiry
print(result.urgency)             # "high"
print(result.suggested_department) # "Billing Support"

Multi-Label Classification

Some texts belong to multiple categories. Use a list of labels with confidence per label:

class CategoryScore(BaseModel):
    category: str
    confidence: float = Field(ge=0.0, le=1.0)

class MultiLabelResult(BaseModel):
    labels: List[CategoryScore]
    primary_category: str

    @property
    def above_threshold(self) -> List[CategoryScore]:
        return [label for label in self.labels if label.confidence >= 0.5]

CATEGORIES = [
    "technology", "business", "health", "sports",
    "politics", "entertainment", "science", "education"
]

def classify_article(text: str) -> MultiLabelResult:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=MultiLabelResult,
        messages=[
            {
                "role": "system",
                "content": (
                    f"Classify the article into these categories: {CATEGORIES}. "
                    "An article can belong to multiple categories. "
                    "Assign a confidence score to each relevant category. "
                    "Only include categories with confidence > 0.2."
                )
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_article("AI startup raises $50M to revolutionize medical imaging")
for label in result.above_threshold:
    print(f"{label.category}: {label.confidence:.2f}")
# technology: 0.90
# health: 0.85
# business: 0.75

Hierarchical Classification

Some taxonomies have parent-child relationships. Model them explicitly:

class HierarchicalCategory(BaseModel):
    level_1: str = Field(description="Top-level category")
    level_2: str = Field(description="Sub-category")
    level_3: str | None = Field(default=None, description="Specific topic")
    confidence: float = Field(ge=0.0, le=1.0)

TAXONOMY = {
    "Technology": {
        "Software": ["Web Development", "Mobile Apps", "DevOps", "AI/ML"],
        "Hardware": ["Processors", "Storage", "Networking"],
    },
    "Business": {
        "Finance": ["Investing", "Banking", "Cryptocurrency"],
        "Management": ["Leadership", "Strategy", "Operations"],
    },
}

def classify_hierarchical(text: str) -> HierarchicalCategory:
    import json
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=HierarchicalCategory,
        messages=[
            {
                "role": "system",
                "content": (
                    f"Classify the text using this taxonomy:\n"
                    f"{json.dumps(TAXONOMY, indent=2)}\n"
                    "Each level must be a valid entry from the taxonomy."
                )
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_hierarchical("How to deploy FastAPI with Kubernetes")
print(f"{result.level_1} > {result.level_2} > {result.level_3}")
# Technology > Software > DevOps

Batch Classification for Efficiency

When classifying hundreds of items, batch them to reduce overhead:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

import asyncio
from openai import AsyncOpenAI

async_client = instructor.from_openai(AsyncOpenAI())

async def classify_batch(texts: List[str]) -> List[SentimentResult]:
    tasks = [
        async_client.chat.completions.create(
            model="gpt-4o-mini",  # Use mini for high-volume classification
            response_model=SentimentResult,
            messages=[
                {"role": "system", "content": "Classify sentiment precisely."},
                {"role": "user", "content": text}
            ],
        )
        for text in texts
    ]
    return await asyncio.gather(*tasks)

# Process 100 reviews
reviews = ["Great product!", "Terrible experience.", "It was okay."] * 33 + ["Meh."]
results = asyncio.run(classify_batch(reviews))

Use gpt-4o-mini for classification tasks — it is 10-20x cheaper than gpt-4o and performs comparably on classification because the task is constrained by the schema.

FAQ

How do confidence scores from LLMs compare to traditional ML classifiers?

LLM confidence scores are self-reported and not calibrated the same way as logistic regression or softmax probabilities. They tend to be overconfident. Treat them as relative rankings rather than absolute probabilities. If you need calibrated scores, collect labeled data and fit a calibration curve on top of the LLM's raw scores.

Should I fine-tune a model for classification tasks?

For fewer than 20 categories with clear boundaries, prompting with structured outputs works well. For 50+ categories, domain-specific labels, or very high throughput needs, fine-tuning a smaller model is more cost-effective. The structured output approach is ideal for rapid prototyping and medium-volume production use.

How do I handle classification edge cases where the text does not fit any category?

Add an "other" or "unknown" option to your enum/Literal type and instruct the model to use it when confidence in all specific categories is below a threshold. Check the confidence score in your application code and route uncertain classifications to human review.

#Classification #SentimentAnalysis #IntentDetection #StructuredOutputs #Python #AgenticAI #LearnAI #AIEngineering

Classification with Structured Outputs: Sentiment, Intent, and Category Detection

Classification as Structured Extraction

Simple Sentiment Analysis

Intent Detection for Chatbots

Multi-Label Classification

Hierarchical Classification

Batch Classification for Efficiency

FAQ

How do confidence scores from LLMs compare to traditional ML classifiers?

Should I fine-tune a model for classification tasks?

How do I handle classification edge cases where the text does not fit any category?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Post-Call Sentiment + Lead Scoring: CallSphere vs Vapi Analytics Gap

Multi-Channel Analytics Dashboard: CallSphere vs Vapi Single-Channel

Post-Call Sentiment + Topic Extraction: CallSphere vs Vapi

Deploy a Voice Agent on Modal with Python and Serverless GPU