---
title: "Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents"
description: "Build AI agents that generate rich multimodal outputs including images with DALL-E, speech with TTS, PDF documents, and formatted reports. Learn how to orchestrate multiple generation APIs into cohesive, multi-format responses."
canonical: https://callsphere.ai/blog/generating-multimodal-outputs-ai-agents-create-images-audio-documents
category: "Learn Agentic AI"
tags: ["Multimodal Output", "Image Generation", "Text-to-Speech", "Document Generation", "DALL-E"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T01:02:43.719Z
---

# Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents

> Build AI agents that generate rich multimodal outputs including images with DALL-E, speech with TTS, PDF documents, and formatted reports. Learn how to orchestrate multiple generation APIs into cohesive, multi-format responses.

## Beyond Text Responses

Most AI agents return plain text. But many real tasks require richer outputs: a marketing agent should deliver copy alongside generated images, a report agent should produce formatted PDFs, and an accessibility agent should provide audio narrations. This guide builds an agent that generates images, audio, and documents as part of its response.

## Image Generation with DALL-E

Start with the most common multimodal output — generating images from text descriptions:

```mermaid
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus
classify"]
    PLAN["Plan and tool
selection"]
    AGENT["Agent loop
LLM plus tools"]
    GUARD{"Guardrails
and policy"}
    EXEC["Execute and
verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus
next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
import openai
import httpx
from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class GeneratedImage:
    url: str
    local_path: str | None = None
    prompt: str = ""
    revised_prompt: str = ""

async def generate_image(
    prompt: str,
    client: openai.AsyncOpenAI,
    size: str = "1024x1024",
    quality: str = "standard",
    save_dir: str = "./outputs",
) -> GeneratedImage:
    """Generate an image using DALL-E 3."""
    response = await client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,
        n=1,
    )

    image_url = response.data[0].url
    revised = response.data[0].revised_prompt

    # Download and save locally
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = prompt[:50].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.png"

    async with httpx.AsyncClient() as http:
        img_response = await http.get(image_url)
        with open(local_path, "wb") as f:
            f.write(img_response.content)

    return GeneratedImage(
        url=image_url,
        local_path=local_path,
        prompt=prompt,
        revised_prompt=revised,
    )
```

## Text-to-Speech Generation

For audio output, use OpenAI's TTS API to convert text to natural speech:

```python
@dataclass
class GeneratedAudio:
    local_path: str
    duration_estimate: float
    voice: str
    text: str

async def generate_speech(
    text: str,
    client: openai.AsyncOpenAI,
    voice: str = "alloy",
    save_dir: str = "./outputs",
) -> GeneratedAudio:
    """Generate speech audio from text using OpenAI TTS."""
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = text[:30].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.mp3"

    response = await client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,
        input=text,
    )

    with open(local_path, "wb") as f:
        f.write(response.content)

    # Rough duration estimate: ~150 words per minute
    word_count = len(text.split())
    duration = word_count / 150 * 60

    return GeneratedAudio(
        local_path=local_path,
        duration_estimate=duration,
        voice=voice,
        text=text,
    )
```

## PDF Document Generation

For structured document output, generate PDFs with formatted text, tables, and embedded images:

```python
from reportlab.lib.pagesizes import letter
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, Table,
    TableStyle, RLImage,
)
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors

@dataclass
class GeneratedDocument:
    local_path: str
    page_count: int
    title: str

def generate_pdf_report(
    title: str,
    sections: list[dict],
    save_dir: str = "./outputs",
    images: list[str] | None = None,
) -> GeneratedDocument:
    """Generate a formatted PDF report.

    Each section: {"heading": str, "body": str, "table": list | None}
    """
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = title[:40].replace(" ", "_")
    path = f"{save_dir}/{safe_name}.pdf"

    doc = SimpleDocTemplate(path, pagesize=letter)
    styles = getSampleStyleSheet()
    story = []

    # Title
    story.append(Paragraph(title, styles["Title"]))
    story.append(Spacer(1, 20))

    for section in sections:
        # Section heading
        story.append(
            Paragraph(section["heading"], styles["Heading2"])
        )
        story.append(Spacer(1, 10))

        # Body text
        story.append(
            Paragraph(section["body"], styles["BodyText"])
        )
        story.append(Spacer(1, 10))

        # Optional table
        if section.get("table"):
            table_data = section["table"]
            t = Table(table_data)
            t.setStyle(TableStyle([
                ("BACKGROUND", (0, 0), (-1, 0), colors.grey),
                ("TEXTCOLOR", (0, 0), (-1, 0), colors.whitesmoke),
                ("GRID", (0, 0), (-1, -1), 1, colors.black),
                ("FONTSIZE", (0, 0), (-1, -1), 9),
            ]))
            story.append(t)
            story.append(Spacer(1, 15))

    # Embed images if provided
    for img_path in (images or []):
        if Path(img_path).exists():
            story.append(RLImage(img_path, width=400, height=300))
            story.append(Spacer(1, 15))

    doc.build(story)

    # Approximate page count
    page_count = max(1, len(sections) // 3 + 1)

    return GeneratedDocument(
        local_path=path,
        page_count=page_count,
        title=title,
    )
```

## The Multimodal Output Agent

Bring all generators together into an agent that decides which output formats are appropriate for each request:

```python
@dataclass
class MultimodalResponse:
    text: str
    images: list[GeneratedImage] = field(default_factory=list)
    audio: list[GeneratedAudio] = field(default_factory=list)
    documents: list[GeneratedDocument] = field(default_factory=list)

class MultimodalOutputAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def _plan_outputs(self, query: str) -> dict:
        """Ask the LLM what output formats are appropriate."""
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Decide what output formats to generate. "
                        "Return a JSON object with boolean fields: "
                        "needs_image, needs_audio, needs_document, "
                        "and string fields: image_prompt, "
                        "audio_text, document_title, text_response."
                    ),
                },
                {"role": "user", "content": query},
            ],
            response_format={"type": "json_object"},
        )
        import json
        return json.loads(response.choices[0].message.content)

    async def respond(self, query: str) -> MultimodalResponse:
        plan = await self._plan_outputs(query)

        result = MultimodalResponse(
            text=plan.get("text_response", "")
        )

        # Generate outputs in parallel where possible
        import asyncio
        tasks = []

        if plan.get("needs_image") and plan.get("image_prompt"):
            tasks.append(self._gen_image(plan["image_prompt"]))

        if plan.get("needs_audio") and plan.get("audio_text"):
            tasks.append(self._gen_audio(plan["audio_text"]))

        outputs = await asyncio.gather(*tasks, return_exceptions=True)

        for output in outputs:
            if isinstance(output, GeneratedImage):
                result.images.append(output)
            elif isinstance(output, GeneratedAudio):
                result.audio.append(output)
            elif isinstance(output, Exception):
                result.text += (
                    f"\n\n[Generation error: {output}]"
                )

        if plan.get("needs_document"):
            doc = generate_pdf_report(
                title=plan.get("document_title", "Report"),
                sections=[{
                    "heading": "Content",
                    "body": result.text,
                }],
                images=[
                    img.local_path
                    for img in result.images
                    if img.local_path
                ],
            )
            result.documents.append(doc)

        return result

    async def _gen_image(self, prompt: str) -> GeneratedImage:
        return await generate_image(prompt, self.client)

    async def _gen_audio(self, text: str) -> GeneratedAudio:
        return await generate_speech(text, self.client)
```

## Usage Example

```python
import asyncio

async def main():
    agent = MultimodalOutputAgent()

    response = await agent.respond(
        "Create a brief market analysis report for the AI "
        "industry in 2026, with a cover image and an audio "
        "executive summary."
    )

    print("Text:", response.text[:200])
    print("Images:", [img.local_path for img in response.images])
    print("Audio:", [a.local_path for a in response.audio])
    print("Docs:", [d.local_path for d in response.documents])

asyncio.run(main())
```

## FAQ

### How do I control the cost of generating multiple output types per request?

Implement a budget system that tracks estimated costs per generation type. DALL-E 3 costs approximately $0.04 per standard image, TTS costs about $0.015 per 1000 characters, and GPT-4o planning costs standard token rates. Set per-request spending limits and skip optional outputs (like images) when the budget is tight. Also cache generated outputs — if the same image prompt appears twice, serve the cached version instead of regenerating.

### Can I use open-source alternatives instead of OpenAI APIs for generation?

Yes. For image generation, use Stable Diffusion via a local ComfyUI or A1111 server. For TTS, Coqui TTS and Bark provide open-source speech synthesis. For document generation, reportlab (shown above) is already open-source and runs locally with no API calls. Replace the API calls in each generator function with calls to your local model servers while keeping the same return types.

### How do I serve multimodal outputs through a web API?

Return a JSON response with the text content inline and URLs or file paths for binary outputs. For a FastAPI endpoint, upload generated images and audio to cloud storage (S3, GCS) and return signed URLs. Alternatively, serve files directly from the local output directory using FastAPI's `StaticFiles` mount. For documents, return a download URL that streams the PDF directly to the client.

---

#MultimodalOutput #ImageGeneration #TexttoSpeech #DocumentGeneration #DALLE #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/generating-multimodal-outputs-ai-agents-create-images-audio-documents
