Gemini Multi-Modal Agents: Processing Images, Video, and Audio Together

Text-only agents miss most of the information in the real world. Documents contain charts and diagrams. Customer support involves screenshots. Security systems produce video feeds. Call centers generate hours of audio. Gemini processes all of these natively in a single model — no separate OCR, speech-to-text, or vision pipelines required.

This unified approach means your agent can reason across modalities. It can look at a screenshot of an error, read the stack trace in the image, correlate it with code you provide as text, and explain the fix — all in one inference call.

Processing Images

The simplest multi-modal interaction sends an image with a text prompt:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

import google.generativeai as genai
from pathlib import Path
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

# Load image from file
image_path = Path("screenshot.png")
image_data = image_path.read_bytes()

response = model.generate_content([
    "Analyze this UI screenshot. Identify any usability issues and suggest improvements.",
    {"mime_type": "image/png", "data": image_data},
])

print(response.text)

You can also pass multiple images in a single request for comparison tasks:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

before = Path("ui_before.png").read_bytes()
after = Path("ui_after.png").read_bytes()

response = model.generate_content([
    "Compare these two UI designs. What changed? Which is better for accessibility?",
    {"mime_type": "image/png", "data": before},
    {"mime_type": "image/png", "data": after},
])

Uploading Large Files with the Files API

For files larger than 20MB, or when you want to reuse media across multiple requests, use the Files API:

# Upload a video file
video_file = genai.upload_file(
    path="meeting_recording.mp4",
    display_name="Team standup March 17",
)

# Wait for processing to complete
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
    raise ValueError(f"File processing failed: {video_file.state.name}")

print(f"File ready: {video_file.uri}")

Once uploaded, reference the file in your requests:

response = model.generate_content([
    video_file,
    "Summarize this meeting. List action items with the person responsible for each.",
])

print(response.text)

Video Analysis with Timestamps

Gemini can analyze video content and reference specific timestamps:

model = genai.GenerativeModel(
    "gemini-2.0-flash",
    system_instruction="""You are a video analysis agent. When referencing
    moments in the video, always include the timestamp in MM:SS format.""",
)

response = model.generate_content([
    video_file,
    "Identify all the key moments in this product demo. "
    "For each moment, provide the timestamp, what is shown, and why it matters.",
])

print(response.text)

Gemini samples video at approximately 1 frame per second, so it captures visual changes effectively. A 1-hour video uses approximately 258K tokens for video frames plus additional tokens for any audio track.

Audio Transcription and Analysis

Gemini handles audio natively — no separate speech-to-text step required:

audio_file = genai.upload_file(path="customer_call.wav")

# Wait for processing
import time
while audio_file.state.name == "PROCESSING":
    time.sleep(3)
    audio_file = genai.get_file(audio_file.name)

response = model.generate_content([
    audio_file,
    "Transcribe this customer call. Then analyze the sentiment, "
    "identify the main issue, and rate the agent's performance.",
])

print(response.text)

Supported audio formats include WAV, MP3, AIFF, AAC, OGG, and FLAC. Audio is processed at a rate of approximately 32 tokens per second.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Here is a complete agent that processes mixed media inputs:

import google.generativeai as genai
from pathlib import Path
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

class MultiModalAgent:
    def __init__(self):
        self.model = genai.GenerativeModel(
            "gemini-2.0-flash",
            system_instruction=(
                "You are a helpful assistant that can analyze text, "
                "images, audio, and video. Always describe what you "
                "observe in each media type before answering questions."
            ),
        )
        self.chat = self.model.start_chat()

    def send(self, text: str, media_paths: list[str] = None) -> str:
        parts = []
        if media_paths:
            for path in media_paths:
                file_obj = genai.upload_file(path=path)
                # Poll until ready
                import time
                while file_obj.state.name == "PROCESSING":
                    time.sleep(2)
                    file_obj = genai.get_file(file_obj.name)
                parts.append(file_obj)
        parts.append(text)

        response = self.chat.send_message(parts)
        return response.text

agent = MultiModalAgent()

# Analyze an image and audio together
result = agent.send(
    "The image shows our server dashboard and the audio is an alert notification. "
    "What is the server status and is the alert critical?",
    media_paths=["dashboard.png", "alert.wav"],
)
print(result)

FAQ

What are the file size limits for Gemini media uploads?

Inline data (passed directly in the request) is limited to 20MB. The Files API supports uploads up to 2GB per file. Uploaded files are stored for 48 hours and then automatically deleted.

Can Gemini process live video streams?

Gemini's standard API processes pre-recorded media. For real-time processing, the Gemini Live API supports streaming audio and video input with low-latency responses. This is available through the Vertex AI platform.

How many images can I include in a single request?

Gemini supports up to 3,600 image files in a single request, though practical limits depend on total token count. Each image consumes approximately 258 tokens. For most agent applications, sending 5-20 images per request is the practical sweet spot.

#GoogleGemini #MultiModalAI #ComputerVision #AudioProcessing #Python #AgenticAI #LearnAI #AIEngineering

Gemini Multi-Modal Agents: Processing Images, Video, and Audio Together

Processing Images

Uploading Large Files with the Files API

Video Analysis with Timestamps

Audio Transcription and Analysis

FAQ

What are the file size limits for Gemini media uploads?

Can Gemini process live video streams?

How many images can I include in a single request?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Gemini 3.1 Pro and Workspace Intelligence: April 2026 Chat Agent Updates

Deploy a Voice Agent on Modal with Python and Serverless GPU

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings

Why Multi-Modal Agents Matter

Processing Images

Uploading Large Files with the Files API

Video Analysis with Timestamps

Audio Transcription and Analysis

Building a Multi-Modal Agent

FAQ

What are the file size limits for Gemini media uploads?

Can Gemini process live video streams?

How many images can I include in a single request?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Gemini 3.1 Pro and Workspace Intelligence: April 2026 Chat Agent Updates

Deploy a Voice Agent on Modal with Python and Serverless GPU

Pydantic AI April 2026 Update: Typed Agents and Structured Tools

How to Add RAG to a Voice Agent with ChromaDB and OpenAI Embeddings