Skip to content
Learn Agentic AI
Learn Agentic AI11 min read6 views

Classification with Structured Outputs: Sentiment, Intent, and Category Detection

Implement text classification systems using structured outputs. Build sentiment analysis, intent detection, and hierarchical category classification with enum constraints, confidence scores, and multi-label support.

Classification as Structured Extraction

Text classification is a special case of structured output: instead of extracting free-form entities, you are constraining the model to pick from a fixed set of labels. Structured outputs turn this into a reliable, repeatable process by using enums to restrict possible values and numeric fields for confidence scores.

This approach gives you three things that prompt-only classification cannot: guaranteed valid labels, calibrated confidence scores, and multi-label support — all in a single API call.

Simple Sentiment Analysis

Start with the most basic classification task:

flowchart TD
    START["Classification with Structured Outputs: Sentiment…"] --> A
    A["Classification as Structured Extraction"]
    A --> B
    B["Simple Sentiment Analysis"]
    B --> C
    C["Intent Detection for Chatbots"]
    C --> D
    D["Multi-Label Classification"]
    D --> E
    E["Hierarchical Classification"]
    E --> F
    F["Batch Classification for Efficiency"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from pydantic import BaseModel, Field
from typing import Literal
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class SentimentResult(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    confidence: float = Field(ge=0.0, le=1.0)
    reasoning: str = Field(description="Brief explanation for the classification")

def classify_sentiment(text: str) -> SentimentResult:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=SentimentResult,
        messages=[
            {
                "role": "system",
                "content": "Classify the sentiment of the given text. Be precise with confidence scores."
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_sentiment("The product works great but shipping took forever.")
print(result)
# SentimentResult(sentiment='neutral', confidence=0.65, reasoning='Mixed sentiment...')

The Literal type ensures the model can only return one of the three valid labels. No post-processing needed to handle misspellings or creative label names.

Intent Detection for Chatbots

Customer support systems need to route messages by intent. Define a comprehensive intent taxonomy:

from enum import Enum
from typing import List

class CustomerIntent(str, Enum):
    billing_inquiry = "billing_inquiry"
    technical_support = "technical_support"
    account_management = "account_management"
    product_information = "product_information"
    complaint = "complaint"
    cancellation = "cancellation"
    feedback = "feedback"
    general_question = "general_question"

class IntentClassification(BaseModel):
    primary_intent: CustomerIntent
    secondary_intent: CustomerIntent | None = None
    confidence: float = Field(ge=0.0, le=1.0)
    urgency: Literal["low", "medium", "high", "critical"]
    suggested_department: str

def classify_intent(message: str) -> IntentClassification:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=IntentClassification,
        messages=[
            {
                "role": "system",
                "content": (
                    "Classify the customer message intent. "
                    "Assign urgency based on customer frustration level "
                    "and business impact. Suggest the right department to handle it."
                )
            },
            {"role": "user", "content": message}
        ],
    )

result = classify_intent("I've been charged twice for my subscription and nobody is responding!")
print(result.primary_intent)      # CustomerIntent.billing_inquiry
print(result.urgency)             # "high"
print(result.suggested_department) # "Billing Support"

Multi-Label Classification

Some texts belong to multiple categories. Use a list of labels with confidence per label:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

class CategoryScore(BaseModel):
    category: str
    confidence: float = Field(ge=0.0, le=1.0)

class MultiLabelResult(BaseModel):
    labels: List[CategoryScore]
    primary_category: str

    @property
    def above_threshold(self) -> List[CategoryScore]:
        return [label for label in self.labels if label.confidence >= 0.5]

CATEGORIES = [
    "technology", "business", "health", "sports",
    "politics", "entertainment", "science", "education"
]

def classify_article(text: str) -> MultiLabelResult:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=MultiLabelResult,
        messages=[
            {
                "role": "system",
                "content": (
                    f"Classify the article into these categories: {CATEGORIES}. "
                    "An article can belong to multiple categories. "
                    "Assign a confidence score to each relevant category. "
                    "Only include categories with confidence > 0.2."
                )
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_article("AI startup raises $50M to revolutionize medical imaging")
for label in result.above_threshold:
    print(f"{label.category}: {label.confidence:.2f}")
# technology: 0.90
# health: 0.85
# business: 0.75

Hierarchical Classification

Some taxonomies have parent-child relationships. Model them explicitly:

class HierarchicalCategory(BaseModel):
    level_1: str = Field(description="Top-level category")
    level_2: str = Field(description="Sub-category")
    level_3: str | None = Field(default=None, description="Specific topic")
    confidence: float = Field(ge=0.0, le=1.0)

TAXONOMY = {
    "Technology": {
        "Software": ["Web Development", "Mobile Apps", "DevOps", "AI/ML"],
        "Hardware": ["Processors", "Storage", "Networking"],
    },
    "Business": {
        "Finance": ["Investing", "Banking", "Cryptocurrency"],
        "Management": ["Leadership", "Strategy", "Operations"],
    },
}

def classify_hierarchical(text: str) -> HierarchicalCategory:
    import json
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=HierarchicalCategory,
        messages=[
            {
                "role": "system",
                "content": (
                    f"Classify the text using this taxonomy:\n"
                    f"{json.dumps(TAXONOMY, indent=2)}\n"
                    "Each level must be a valid entry from the taxonomy."
                )
            },
            {"role": "user", "content": text}
        ],
    )

result = classify_hierarchical("How to deploy FastAPI with Kubernetes")
print(f"{result.level_1} > {result.level_2} > {result.level_3}")
# Technology > Software > DevOps

Batch Classification for Efficiency

When classifying hundreds of items, batch them to reduce overhead:

import asyncio
from openai import AsyncOpenAI

async_client = instructor.from_openai(AsyncOpenAI())

async def classify_batch(texts: List[str]) -> List[SentimentResult]:
    tasks = [
        async_client.chat.completions.create(
            model="gpt-4o-mini",  # Use mini for high-volume classification
            response_model=SentimentResult,
            messages=[
                {"role": "system", "content": "Classify sentiment precisely."},
                {"role": "user", "content": text}
            ],
        )
        for text in texts
    ]
    return await asyncio.gather(*tasks)

# Process 100 reviews
reviews = ["Great product!", "Terrible experience.", "It was okay."] * 33 + ["Meh."]
results = asyncio.run(classify_batch(reviews))

Use gpt-4o-mini for classification tasks — it is 10-20x cheaper than gpt-4o and performs comparably on classification because the task is constrained by the schema.

FAQ

How do confidence scores from LLMs compare to traditional ML classifiers?

LLM confidence scores are self-reported and not calibrated the same way as logistic regression or softmax probabilities. They tend to be overconfident. Treat them as relative rankings rather than absolute probabilities. If you need calibrated scores, collect labeled data and fit a calibration curve on top of the LLM's raw scores.

Should I fine-tune a model for classification tasks?

For fewer than 20 categories with clear boundaries, prompting with structured outputs works well. For 50+ categories, domain-specific labels, or very high throughput needs, fine-tuning a smaller model is more cost-effective. The structured output approach is ideal for rapid prototyping and medium-volume production use.

How do I handle classification edge cases where the text does not fit any category?

Add an "other" or "unknown" option to your enum/Literal type and instruct the model to use it when confidence in all specific categories is below a threshold. Check the confidence score in your application code and route uncertain classifications to human review.


#Classification #SentimentAnalysis #IntentDetection #StructuredOutputs #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Healthcare

HCAHPS and Patient Experience Surveys via AI Voice Agents: Higher Response Rates, Faster Insight

Deploy AI voice agents to run HCAHPS-compliant post-visit surveys, boost response rates from 27% to 51%, and feed structured sentiment into your patient experience dashboard.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.

Learn Agentic AI

LangGraph Agent Patterns 2026: Building Stateful Multi-Step AI Workflows

Complete LangGraph tutorial covering state machines for agents, conditional edges, human-in-the-loop patterns, checkpointing, and parallel execution with full code examples.