Skip to content
Learn Agentic AI
Learn Agentic AI11 min read0 views

Building a Unified AI Agent API: One API for Chat, Voice, and Task Agents

Design a single unified API that serves chat, voice, and task-based AI agents through a common interface. Learn channel abstraction, response normalization, and how to handle the unique requirements of each modality without code duplication.

The Problem with Separate Agent APIs

Many organizations start with one API for their chatbot, another for their voice agent, and yet another for task automation. Each API has its own authentication, session management, error handling, and data models. Within months, you are maintaining three codebases that do fundamentally the same thing — send user input to an AI agent and return a response — but with incompatible interfaces.

A unified API consolidates these into a single interface with channel-specific adapters. The core logic — agent routing, conversation management, tool execution — lives in one place. Channel-specific concerns like voice transcription or chat formatting are handled at the edges.

The Unified Request Model

Design a request model that accommodates all channels through a common structure with channel-specific extensions:

flowchart TD
    START["Building a Unified AI Agent API: One API for Chat…"] --> A
    A["The Problem with Separate Agent APIs"]
    A --> B
    B["The Unified Request Model"]
    B --> C
    C["Channel Adapters"]
    C --> D
    D["The Unified Endpoint"]
    D --> E
    E["Streaming Across Channels"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from pydantic import BaseModel, Field
from typing import Any, Optional, Literal
from enum import Enum

class Channel(str, Enum):
    CHAT = "chat"
    VOICE = "voice"
    TASK = "task"
    EMAIL = "email"

class InputContent(BaseModel):
    text: Optional[str] = None
    audio_url: Optional[str] = None
    audio_base64: Optional[str] = None
    attachments: list[dict] = Field(default_factory=list)

class UnifiedRequest(BaseModel):
    channel: Channel
    session_id: str
    agent_id: str
    input: InputContent
    context: dict[str, Any] = Field(default_factory=dict)
    response_format: Literal["text", "ssml", "audio", "structured"] = "text"
    stream: bool = False

class ToolCallOutput(BaseModel):
    call_id: str
    tool_name: str
    arguments: dict[str, Any]

class UnifiedResponse(BaseModel):
    session_id: str
    agent_id: str
    channel: Channel
    text: Optional[str] = None
    ssml: Optional[str] = None
    audio_url: Optional[str] = None
    tool_calls: list[ToolCallOutput] = Field(default_factory=list)
    metadata: dict[str, Any] = Field(default_factory=dict)
    usage: dict[str, int] = Field(default_factory=dict)

A chat client sends {"channel": "chat", "input": {"text": "Hello"}}. A voice client sends {"channel": "voice", "input": {"audio_base64": "..."}}. A task agent sends {"channel": "task", "input": {"text": "Analyze this dataset"}}. The same endpoint handles all three.

Channel Adapters

Each channel has preprocessing and postprocessing needs. Adapters handle these transformations:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from abc import ABC, abstractmethod

class ChannelAdapter(ABC):
    @abstractmethod
    async def preprocess(self, request: UnifiedRequest) -> str:
        """Convert channel-specific input to plain text for the agent."""
        pass

    @abstractmethod
    async def postprocess(
        self, text: str, request: UnifiedRequest
    ) -> dict:
        """Convert agent text output to channel-specific format."""
        pass

class ChatAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        return {"text": text}

class VoiceAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        if request.input.audio_base64:
            return await transcribe_audio(request.input.audio_base64)
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "ssml":
            return {"ssml": text_to_ssml(text)}
        if request.response_format == "audio":
            audio_url = await synthesize_speech(text)
            return {"audio_url": audio_url, "text": text}
        return {"text": text}

class TaskAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        # Tasks may include structured instructions
        parts = [request.input.text or ""]
        for attachment in request.input.attachments:
            parts.append(f"[Attachment: {attachment.get('name', 'file')}]")
        return "\n".join(parts)

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "structured":
            return {"text": text, "metadata": {"structured": True}}
        return {"text": text}

ADAPTERS: dict[Channel, ChannelAdapter] = {
    Channel.CHAT: ChatAdapter(),
    Channel.VOICE: VoiceAdapter(),
    Channel.TASK: TaskAdapter(),
}

The Unified Endpoint

The main endpoint delegates to the appropriate adapter, runs the agent, and normalizes the response:

from fastapi import FastAPI

app = FastAPI(title="Unified Agent API")

@app.post("/v1/agent/invoke")
async def invoke_agent(request: UnifiedRequest) -> UnifiedResponse:
    adapter = ADAPTERS[request.channel]

    # Preprocess: convert channel input to text
    user_text = await adapter.preprocess(request)

    # Load conversation history
    history = await get_session_messages(request.session_id)

    # Run the agent
    agent_result = await run_agent(
        agent_id=request.agent_id,
        user_message=user_text,
        history=history,
        context=request.context,
    )

    # Postprocess: convert text to channel-appropriate format
    output = await adapter.postprocess(agent_result["text"], request)

    # Save to session history
    await save_message(request.session_id, "user", user_text)
    await save_message(request.session_id, "assistant", agent_result["text"])

    return UnifiedResponse(
        session_id=request.session_id,
        agent_id=request.agent_id,
        channel=request.channel,
        tool_calls=[
            ToolCallOutput(**tc) for tc in agent_result.get("tool_calls", [])
        ],
        usage=agent_result.get("usage", {}),
        **output,
    )

Streaming Across Channels

Streaming works differently per channel. Chat needs Server-Sent Events. Voice needs audio chunks. Tasks may not need streaming at all:

from fastapi.responses import StreamingResponse
import json

@app.post("/v1/agent/stream")
async def stream_agent(request: UnifiedRequest):
    adapter = ADAPTERS[request.channel]
    user_text = await adapter.preprocess(request)
    history = await get_session_messages(request.session_id)

    async def event_stream():
        full_text = ""
        async for chunk in stream_agent_response(
            agent_id=request.agent_id,
            user_message=user_text,
            history=history,
        ):
            full_text += chunk["text"]
            output = await adapter.postprocess(chunk["text"], request)
            event_data = json.dumps({
                "session_id": request.session_id,
                "chunk": output,
                "done": chunk.get("done", False),
            })
            yield f"data: {event_data}\n\n"

        await save_message(request.session_id, "user", user_text)
        await save_message(request.session_id, "assistant", full_text)

    return StreamingResponse(event_stream(), media_type="text/event-stream")

FAQ

How do I handle channel-specific features like voice barge-in or chat typing indicators?

Add channel-specific metadata to the context field of the request and response. For voice barge-in, the client sends {"context": {"voice_barge_in": true}}. The voice adapter checks this flag and adjusts response behavior. Keep these features in the adapter layer, not in core agent logic.

Should the unified API normalize all responses to text, or preserve rich formats?

Always generate text as the canonical format, then let adapters transform it. The agent produces text. The chat adapter returns it as-is. The voice adapter converts it to SSML or audio. The task adapter may parse it into structured JSON. This keeps agent logic channel-agnostic.

How do I route to different agent implementations based on channel?

Add routing logic in the endpoint that selects the agent based on both agent_id and channel. A customer service agent might use a faster model for chat and a more capable model for complex task requests. Store this mapping in configuration rather than code.


#UnifiedAPI #AIAgents #APIDesign #FastAPI #MultiChannel #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Automating Client Document Collection: How AI Agents Chase Missing Tax Documents and Reduce Filing Delays

See how AI agents automate tax document collection — chasing missing W-2s, 1099s, and receipts via calls and texts to eliminate the #1 CPA bottleneck.

Learn Agentic AI

API Design for AI Agent Tool Functions: Best Practices and Anti-Patterns

How to design tool functions that LLMs can use effectively with clear naming, enum parameters, structured responses, informative error messages, and documentation.

Learn Agentic AI

AI Agents for IT Helpdesk: L1 Automation, Ticket Routing, and Knowledge Base Integration

Build IT helpdesk AI agents with multi-agent architecture for triage, device, network, and security issues. RAG-powered knowledge base, automated ticket creation, routing, and escalation.

Learn Agentic AI

Computer Use in GPT-5.4: Building AI Agents That Navigate Desktop Applications

Technical guide to GPT-5.4's computer use capabilities for building AI agents that interact with desktop UIs, browser automation, and real-world application workflows.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Google Cloud AI Agent Trends Report 2026: Key Findings and Developer Implications

Analysis of Google Cloud's 2026 AI agent trends report covering Gemini-powered agents, Google ADK, Vertex AI agent builder, and enterprise adoption patterns.