Skip to content
Learn Agentic AI
Learn Agentic AI12 min read1 views

Gemini Multi-Modal Agents: Processing Images, Video, and Audio Together

Build agents that see, hear, and understand multiple media types simultaneously. Learn Gemini's media upload API, inline data handling, video analysis, and audio transcription capabilities.

Why Multi-Modal Agents Matter

Text-only agents miss most of the information in the real world. Documents contain charts and diagrams. Customer support involves screenshots. Security systems produce video feeds. Call centers generate hours of audio. Gemini processes all of these natively in a single model — no separate OCR, speech-to-text, or vision pipelines required.

This unified approach means your agent can reason across modalities. It can look at a screenshot of an error, read the stack trace in the image, correlate it with code you provide as text, and explain the fix — all in one inference call.

Processing Images

The simplest multi-modal interaction sends an image with a text prompt:

flowchart TD
    START["Gemini Multi-Modal Agents: Processing Images, Vid…"] --> A
    A["Why Multi-Modal Agents Matter"]
    A --> B
    B["Processing Images"]
    B --> C
    C["Uploading Large Files with the Files API"]
    C --> D
    D["Video Analysis with Timestamps"]
    D --> E
    E["Audio Transcription and Analysis"]
    E --> F
    F["Building a Multi-Modal Agent"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import google.generativeai as genai
from pathlib import Path
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

# Load image from file
image_path = Path("screenshot.png")
image_data = image_path.read_bytes()

response = model.generate_content([
    "Analyze this UI screenshot. Identify any usability issues and suggest improvements.",
    {"mime_type": "image/png", "data": image_data},
])

print(response.text)

You can also pass multiple images in a single request for comparison tasks:

before = Path("ui_before.png").read_bytes()
after = Path("ui_after.png").read_bytes()

response = model.generate_content([
    "Compare these two UI designs. What changed? Which is better for accessibility?",
    {"mime_type": "image/png", "data": before},
    {"mime_type": "image/png", "data": after},
])

Uploading Large Files with the Files API

For files larger than 20MB, or when you want to reuse media across multiple requests, use the Files API:

# Upload a video file
video_file = genai.upload_file(
    path="meeting_recording.mp4",
    display_name="Team standup March 17",
)

# Wait for processing to complete
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
    raise ValueError(f"File processing failed: {video_file.state.name}")

print(f"File ready: {video_file.uri}")

Once uploaded, reference the file in your requests:

response = model.generate_content([
    video_file,
    "Summarize this meeting. List action items with the person responsible for each.",
])

print(response.text)

Video Analysis with Timestamps

Gemini can analyze video content and reference specific timestamps:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

model = genai.GenerativeModel(
    "gemini-2.0-flash",
    system_instruction="""You are a video analysis agent. When referencing
    moments in the video, always include the timestamp in MM:SS format.""",
)

response = model.generate_content([
    video_file,
    "Identify all the key moments in this product demo. "
    "For each moment, provide the timestamp, what is shown, and why it matters.",
])

print(response.text)

Gemini samples video at approximately 1 frame per second, so it captures visual changes effectively. A 1-hour video uses approximately 258K tokens for video frames plus additional tokens for any audio track.

Audio Transcription and Analysis

Gemini handles audio natively — no separate speech-to-text step required:

audio_file = genai.upload_file(path="customer_call.wav")

# Wait for processing
import time
while audio_file.state.name == "PROCESSING":
    time.sleep(3)
    audio_file = genai.get_file(audio_file.name)

response = model.generate_content([
    audio_file,
    "Transcribe this customer call. Then analyze the sentiment, "
    "identify the main issue, and rate the agent's performance.",
])

print(response.text)

Supported audio formats include WAV, MP3, AIFF, AAC, OGG, and FLAC. Audio is processed at a rate of approximately 32 tokens per second.

Building a Multi-Modal Agent

Here is a complete agent that processes mixed media inputs:

import google.generativeai as genai
from pathlib import Path
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

class MultiModalAgent:
    def __init__(self):
        self.model = genai.GenerativeModel(
            "gemini-2.0-flash",
            system_instruction=(
                "You are a helpful assistant that can analyze text, "
                "images, audio, and video. Always describe what you "
                "observe in each media type before answering questions."
            ),
        )
        self.chat = self.model.start_chat()

    def send(self, text: str, media_paths: list[str] = None) -> str:
        parts = []
        if media_paths:
            for path in media_paths:
                file_obj = genai.upload_file(path=path)
                # Poll until ready
                import time
                while file_obj.state.name == "PROCESSING":
                    time.sleep(2)
                    file_obj = genai.get_file(file_obj.name)
                parts.append(file_obj)
        parts.append(text)

        response = self.chat.send_message(parts)
        return response.text

agent = MultiModalAgent()

# Analyze an image and audio together
result = agent.send(
    "The image shows our server dashboard and the audio is an alert notification. "
    "What is the server status and is the alert critical?",
    media_paths=["dashboard.png", "alert.wav"],
)
print(result)

FAQ

What are the file size limits for Gemini media uploads?

Inline data (passed directly in the request) is limited to 20MB. The Files API supports uploads up to 2GB per file. Uploaded files are stored for 48 hours and then automatically deleted.

Can Gemini process live video streams?

Gemini's standard API processes pre-recorded media. For real-time processing, the Gemini Live API supports streaming audio and video input with low-latency responses. This is available through the Vertex AI platform.

How many images can I include in a single request?

Gemini supports up to 3,600 image files in a single request, though practical limits depend on total token count. Each image consumes approximately 258 tokens. For most agent applications, sending 5-20 images per request is the practical sweet spot.


#GoogleGemini #MultiModalAI #ComputerVision #AudioProcessing #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

OpenAI Agents SDK in 2026: Building Multi-Agent Systems with Handoffs and Guardrails

Complete tutorial on the OpenAI Agents SDK covering agent creation, tool definitions, handoff patterns between specialist agents, and input/output guardrails for safe AI systems.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.

Learn Agentic AI

Build a Customer Support Agent from Scratch: Python, OpenAI, and Twilio in 60 Minutes

Step-by-step tutorial to build a production-ready customer support AI agent using Python FastAPI, OpenAI Agents SDK, and Twilio Voice with five integrated tools.

Learn Agentic AI

LangGraph Agent Patterns 2026: Building Stateful Multi-Step AI Workflows

Complete LangGraph tutorial covering state machines for agents, conditional edges, human-in-the-loop patterns, checkpointing, and parallel execution with full code examples.