Skip to content
Learn Agentic AI
Learn Agentic AI10 min read3 views

Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts

Master multi-modal prompting techniques that combine text, images, and code inputs in a single prompt to unlock more capable and context-rich LLM interactions.

Beyond Text-Only Interactions

Multi-modal models like GPT-4o, Claude, and Gemini accept not just text but images, documents, and structured data in a single prompt. This opens up use cases that were impossible with text-only prompting — analyzing screenshots, interpreting charts, debugging UI layouts, and reasoning over diagrams alongside natural language instructions.

The challenge is learning how to structure these mixed-modality prompts effectively. A poorly structured multi-modal prompt wastes the model's attention on irrelevant visual details or fails to connect the image content to the text instructions.

Vision Plus Text: The Basics

The most common multi-modal pattern combines an image with a text instruction. The key is being specific about what the model should focus on in the image:

flowchart TD
    START["Multi-Modal Prompting: Combining Text, Images, an…"] --> A
    A["Beyond Text-Only Interactions"]
    A --> B
    B["Vision Plus Text: The Basics"]
    B --> C
    C["Multi-Image Comparison Prompts"]
    C --> D
    D["Code Plus Text: Structured Analysis"]
    D --> E
    E["Structured Multi-Modal Inputs"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import openai
import base64
from pathlib import Path

client = openai.OpenAI()

def encode_image(image_path: str) -> str:
    """Encode an image to base64 for the API."""
    image_data = Path(image_path).read_bytes()
    return base64.b64encode(image_data).decode("utf-8")

def analyze_image(
    image_path: str,
    instruction: str,
    detail: str = "high",
) -> str:
    """Analyze an image with a specific text instruction."""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": instruction},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                        "detail": detail,
                    },
                },
            ]},
        ],
    )
    return response.choices[0].message.content

# Specific instruction beats generic "describe this image"
result = analyze_image(
    "dashboard_screenshot.png",
    "Identify all error states visible in this dashboard screenshot. "
    "For each error, note the component name, the error message, "
    "and suggest a likely root cause based on the displayed data."
)

The detail parameter matters for cost and quality. Use "high" when the image contains small text, code, or fine details. Use "low" for simple diagrams or when you only need a general understanding.

Multi-Image Comparison Prompts

You can include multiple images in a single prompt for comparison tasks:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def compare_designs(
    before_path: str,
    after_path: str,
    focus_areas: list[str],
) -> str:
    """Compare two UI designs and identify differences."""
    before_b64 = encode_image(before_path)
    after_b64 = encode_image(after_path)
    focus = ", ".join(focus_areas)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": (
                    "Compare these two UI designs. The first image is "
                    "the BEFORE state and the second is the AFTER state. "
                    f"Focus specifically on: {focus}. "
                    "List every visual difference you find."
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{before_b64}",
                    "detail": "high",
                }},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{after_b64}",
                    "detail": "high",
                }},
            ]},
        ],
    )
    return response.choices[0].message.content

Code Plus Text: Structured Analysis

Combining code snippets with natural language context produces better analysis than either alone:

def review_code_with_context(
    code: str,
    language: str,
    architecture_description: str,
    review_focus: list[str],
) -> str:
    """Review code with architectural context."""
    focus_items = "\n".join(f"- {f}" for f in review_focus)

    prompt = (
        f"## Architecture Context\n\n{architecture_description}\n\n"
        f"## Code to Review\n\n"
        f"~~~{language}\n{code}\n~~~\n\n"
        f"## Review Focus Areas\n\n{focus_items}\n\n"
        "Provide a structured review addressing each focus area. "
        "Reference specific line numbers and suggest concrete fixes."
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

Structured Multi-Modal Inputs

For complex tasks, structure your multi-modal prompt with clear sections:

def structured_multimodal_prompt(
    text_context: str,
    image_paths: list[str],
    code_snippet: str,
    task: str,
) -> str:
    """Build a structured multi-modal prompt."""
    content = [
        {"type": "text", "text": f"## Task\n\n{task}\n\n## Context\n\n{text_context}\n\n## Relevant Code\n\n~~~python\n{code_snippet}\n~~~\n\n## Visual References\n\nAnalyze the following images in order:"},
    ]

    for i, path in enumerate(image_paths):
        content.append(
            {"type": "text", "text": f"\nImage {i + 1}:"}
        )
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/png;base64,{encode_image(path)}",
                "detail": "high",
            },
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
    )
    return response.choices[0].message.content

The pattern of labeling images ("Image 1:", "Image 2:") and providing context before the images helps the model understand the relationship between modalities. Without this structure, the model may describe each image independently rather than integrating information across all inputs.

FAQ

Do all models support multi-modal prompts the same way?

No. The API format varies by provider. OpenAI uses content arrays with type: "text" and type: "image_url" objects. Anthropic uses type: "image" with base64 data in a source block. Google Gemini uses inline_data with mime_type. Always check the provider's documentation for the exact format.

How does image resolution affect quality and cost?

Higher resolution images consume more tokens. GPT-4o's detail: "high" mode tiles the image into 512x512 patches and processes each separately, costing roughly 85 tokens per tile. A 2048x2048 image uses about 1360 tokens. Use detail: "low" (85 tokens flat) when fine detail is not needed to save significantly on cost.

Can I combine images with tool-use in a single interaction?

Yes. Multi-modal inputs work alongside function calling and tool use. A practical example is an agent that receives a screenshot, uses vision to understand the UI state, calls a tool to interact with the application, and then takes another screenshot to verify the result — all within a single agent loop.


#PromptEngineering #MultiModal #Vision #GPT4o #Python #AgenticAI #LearnAI #AIEngineering

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Building a Research Agent with Web Search and Report Generation: Complete Tutorial

Build a research agent that searches the web, extracts and synthesizes data, and generates formatted reports using OpenAI Agents SDK and web search tools.