---
title: "Building a Unified AI Agent API: One API for Chat, Voice, and Task Agents"
description: "Design a single unified API that serves chat, voice, and task-based AI agents through a common interface. Learn channel abstraction, response normalization, and how to handle the unique requirements of each modality without code duplication."
canonical: https://callsphere.ai/blog/unified-ai-agent-api-chat-voice-task
category: "Learn Agentic AI"
tags: ["Unified API", "AI Agents", "API Design", "FastAPI", "Multi-Channel"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.541Z
---

# Building a Unified AI Agent API: One API for Chat, Voice, and Task Agents

> Design a single unified API that serves chat, voice, and task-based AI agents through a common interface. Learn channel abstraction, response normalization, and how to handle the unique requirements of each modality without code duplication.

## The Problem with Separate Agent APIs

Many organizations start with one API for their chatbot, another for their voice agent, and yet another for task automation. Each API has its own authentication, session management, error handling, and data models. Within months, you are maintaining three codebases that do fundamentally the same thing — send user input to an AI agent and return a response — but with incompatible interfaces.

A unified API consolidates these into a single interface with channel-specific adapters. The core logic — agent routing, conversation management, tool execution — lives in one place. Channel-specific concerns like voice transcription or chat formatting are handled at the edges.

## The Unified Request Model

Design a request model that accommodates all channels through a common structure with channel-specific extensions:

```mermaid
flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway
auth plus rate limit"]
    APP["FastAPI app
handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer
business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
```

```python
from pydantic import BaseModel, Field
from typing import Any, Optional, Literal
from enum import Enum

class Channel(str, Enum):
    CHAT = "chat"
    VOICE = "voice"
    TASK = "task"
    EMAIL = "email"

class InputContent(BaseModel):
    text: Optional[str] = None
    audio_url: Optional[str] = None
    audio_base64: Optional[str] = None
    attachments: list[dict] = Field(default_factory=list)

class UnifiedRequest(BaseModel):
    channel: Channel
    session_id: str
    agent_id: str
    input: InputContent
    context: dict[str, Any] = Field(default_factory=dict)
    response_format: Literal["text", "ssml", "audio", "structured"] = "text"
    stream: bool = False

class ToolCallOutput(BaseModel):
    call_id: str
    tool_name: str
    arguments: dict[str, Any]

class UnifiedResponse(BaseModel):
    session_id: str
    agent_id: str
    channel: Channel
    text: Optional[str] = None
    ssml: Optional[str] = None
    audio_url: Optional[str] = None
    tool_calls: list[ToolCallOutput] = Field(default_factory=list)
    metadata: dict[str, Any] = Field(default_factory=dict)
    usage: dict[str, int] = Field(default_factory=dict)
```

A chat client sends `{"channel": "chat", "input": {"text": "Hello"}}`. A voice client sends `{"channel": "voice", "input": {"audio_base64": "..."}}`. A task agent sends `{"channel": "task", "input": {"text": "Analyze this dataset"}}`. The same endpoint handles all three.

## Channel Adapters

Each channel has preprocessing and postprocessing needs. Adapters handle these transformations:

```python
from abc import ABC, abstractmethod

class ChannelAdapter(ABC):
    @abstractmethod
    async def preprocess(self, request: UnifiedRequest) -> str:
        """Convert channel-specific input to plain text for the agent."""
        pass

    @abstractmethod
    async def postprocess(
        self, text: str, request: UnifiedRequest
    ) -> dict:
        """Convert agent text output to channel-specific format."""
        pass

class ChatAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        return {"text": text}

class VoiceAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        if request.input.audio_base64:
            return await transcribe_audio(request.input.audio_base64)
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "ssml":
            return {"ssml": text_to_ssml(text)}
        if request.response_format == "audio":
            audio_url = await synthesize_speech(text)
            return {"audio_url": audio_url, "text": text}
        return {"text": text}

class TaskAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        # Tasks may include structured instructions
        parts = [request.input.text or ""]
        for attachment in request.input.attachments:
            parts.append(f"[Attachment: {attachment.get('name', 'file')}]")
        return "\n".join(parts)

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "structured":
            return {"text": text, "metadata": {"structured": True}}
        return {"text": text}

ADAPTERS: dict[Channel, ChannelAdapter] = {
    Channel.CHAT: ChatAdapter(),
    Channel.VOICE: VoiceAdapter(),
    Channel.TASK: TaskAdapter(),
}
```

## The Unified Endpoint

The main endpoint delegates to the appropriate adapter, runs the agent, and normalizes the response:

```python
from fastapi import FastAPI

app = FastAPI(title="Unified Agent API")

@app.post("/v1/agent/invoke")
async def invoke_agent(request: UnifiedRequest) -> UnifiedResponse:
    adapter = ADAPTERS[request.channel]

    # Preprocess: convert channel input to text
    user_text = await adapter.preprocess(request)

    # Load conversation history
    history = await get_session_messages(request.session_id)

    # Run the agent
    agent_result = await run_agent(
        agent_id=request.agent_id,
        user_message=user_text,
        history=history,
        context=request.context,
    )

    # Postprocess: convert text to channel-appropriate format
    output = await adapter.postprocess(agent_result["text"], request)

    # Save to session history
    await save_message(request.session_id, "user", user_text)
    await save_message(request.session_id, "assistant", agent_result["text"])

    return UnifiedResponse(
        session_id=request.session_id,
        agent_id=request.agent_id,
        channel=request.channel,
        tool_calls=[
            ToolCallOutput(**tc) for tc in agent_result.get("tool_calls", [])
        ],
        usage=agent_result.get("usage", {}),
        **output,
    )
```

## Streaming Across Channels

Streaming works differently per channel. Chat needs Server-Sent Events. Voice needs audio chunks. Tasks may not need streaming at all:

```python
from fastapi.responses import StreamingResponse
import json

@app.post("/v1/agent/stream")
async def stream_agent(request: UnifiedRequest):
    adapter = ADAPTERS[request.channel]
    user_text = await adapter.preprocess(request)
    history = await get_session_messages(request.session_id)

    async def event_stream():
        full_text = ""
        async for chunk in stream_agent_response(
            agent_id=request.agent_id,
            user_message=user_text,
            history=history,
        ):
            full_text += chunk["text"]
            output = await adapter.postprocess(chunk["text"], request)
            event_data = json.dumps({
                "session_id": request.session_id,
                "chunk": output,
                "done": chunk.get("done", False),
            })
            yield f"data: {event_data}\n\n"

        await save_message(request.session_id, "user", user_text)
        await save_message(request.session_id, "assistant", full_text)

    return StreamingResponse(event_stream(), media_type="text/event-stream")
```

## FAQ

### How do I handle channel-specific features like voice barge-in or chat typing indicators?

Add channel-specific metadata to the context field of the request and response. For voice barge-in, the client sends `{"context": {"voice_barge_in": true}}`. The voice adapter checks this flag and adjusts response behavior. Keep these features in the adapter layer, not in core agent logic.

### Should the unified API normalize all responses to text, or preserve rich formats?

Always generate text as the canonical format, then let adapters transform it. The agent produces text. The chat adapter returns it as-is. The voice adapter converts it to SSML or audio. The task adapter may parse it into structured JSON. This keeps agent logic channel-agnostic.

### How do I route to different agent implementations based on channel?

Add routing logic in the endpoint that selects the agent based on both `agent_id` and `channel`. A customer service agent might use a faster model for chat and a more capable model for complex task requests. Store this mapping in configuration rather than code.

---

#UnifiedAPI #AIAgents #APIDesign #FastAPI #MultiChannel #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/unified-ai-agent-api-chat-voice-task
