Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts

Beyond Text-Only Interactions

Multi-modal models like GPT-4o, Claude, and Gemini accept not just text but images, documents, and structured data in a single prompt. This opens up use cases that were impossible with text-only prompting — analyzing screenshots, interpreting charts, debugging UI layouts, and reasoning over diagrams alongside natural language instructions.

The challenge is learning how to structure these mixed-modality prompts effectively. A poorly structured multi-modal prompt wastes the model's attention on irrelevant visual details or fails to connect the image content to the text instructions.

Vision Plus Text: The Basics

The most common multi-modal pattern combines an image with a text instruction. The key is being specific about what the model should focus on in the image:

flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt<br/>role plus rules"]
    SHOTS["Few shot examples<br/>3 to 5"]
    VARS["Variable injection<br/>Jinja or f-string"]
    COT["Chain of thought<br/>or scratchpad"]
    CONSTR["Output constraint<br/>JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval<br/>LLM as judge plus regex"]
    GATE{"Score over<br/>threshold?"}
    COMMIT(["Promote to prod<br/>version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff

import openai
import base64
from pathlib import Path

client = openai.OpenAI()

def encode_image(image_path: str) -> str:
    """Encode an image to base64 for the API."""
    image_data = Path(image_path).read_bytes()
    return base64.b64encode(image_data).decode("utf-8")

def analyze_image(
    image_path: str,
    instruction: str,
    detail: str = "high",
) -> str:
    """Analyze an image with a specific text instruction."""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": instruction},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                        "detail": detail,
                    },
                },
            ]},
        ],
    )
    return response.choices[0].message.content

# Specific instruction beats generic "describe this image"
result = analyze_image(
    "dashboard_screenshot.png",
    "Identify all error states visible in this dashboard screenshot. "
    "For each error, note the component name, the error message, "
    "and suggest a likely root cause based on the displayed data."
)

The detail parameter matters for cost and quality. Use "high" when the image contains small text, code, or fine details. Use "low" for simple diagrams or when you only need a general understanding.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Multi-Image Comparison Prompts

You can include multiple images in a single prompt for comparison tasks:

def compare_designs(
    before_path: str,
    after_path: str,
    focus_areas: list[str],
) -> str:
    """Compare two UI designs and identify differences."""
    before_b64 = encode_image(before_path)
    after_b64 = encode_image(after_path)
    focus = ", ".join(focus_areas)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": (
                    "Compare these two UI designs. The first image is "
                    "the BEFORE state and the second is the AFTER state. "
                    f"Focus specifically on: {focus}. "
                    "List every visual difference you find."
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{before_b64}",
                    "detail": "high",
                }},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{after_b64}",
                    "detail": "high",
                }},
            ]},
        ],
    )
    return response.choices[0].message.content

Code Plus Text: Structured Analysis

Combining code snippets with natural language context produces better analysis than either alone:

def review_code_with_context(
    code: str,
    language: str,
    architecture_description: str,
    review_focus: list[str],
) -> str:
    """Review code with architectural context."""
    focus_items = "\n".join(f"- {f}" for f in review_focus)

    prompt = (
        f"## Architecture Context\n\n{architecture_description}\n\n"
        f"## Code to Review\n\n"
        f"~~~{language}\n{code}\n~~~\n\n"
        f"## Review Focus Areas\n\n{focus_items}\n\n"
        "Provide a structured review addressing each focus area. "
        "Reference specific line numbers and suggest concrete fixes."
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

For complex tasks, structure your multi-modal prompt with clear sections:

def structured_multimodal_prompt(
    text_context: str,
    image_paths: list[str],
    code_snippet: str,
    task: str,
) -> str:
    """Build a structured multi-modal prompt."""
    content = [
        {"type": "text", "text": f"## Task\n\n{task}\n\n## Context\n\n{text_context}\n\n## Relevant Code\n\n~~~python\n{code_snippet}\n~~~\n\n## Visual References\n\nAnalyze the following images in order:"},
    ]

    for i, path in enumerate(image_paths):
        content.append(
            {"type": "text", "text": f"\nImage {i + 1}:"}
        )
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/png;base64,{encode_image(path)}",
                "detail": "high",
            },
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
    )
    return response.choices[0].message.content

The pattern of labeling images ("Image 1:", "Image 2:") and providing context before the images helps the model understand the relationship between modalities. Without this structure, the model may describe each image independently rather than integrating information across all inputs.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

No. The API format varies by provider. OpenAI uses content arrays with type: "text" and type: "image_url" objects. Anthropic uses type: "image" with base64 data in a source block. Google Gemini uses inline_data with mime_type. Always check the provider's documentation for the exact format.

How does image resolution affect quality and cost?

Higher resolution images consume more tokens. GPT-4o's detail: "high" mode tiles the image into 512x512 patches and processes each separately, costing roughly 85 tokens per tile. A 2048x2048 image uses about 1360 tokens. Use detail: "low" (85 tokens flat) when fine detail is not needed to save significantly on cost.

Can I combine images with tool-use in a single interaction?

Yes. Multi-modal inputs work alongside function calling and tool use. A practical example is an agent that receives a screenshot, uses vision to understand the UI state, calls a tool to interact with the application, and then takes another screenshot to verify the result — all within a single agent loop.

#PromptEngineering #MultiModal #Vision #GPT4o #Python #AgenticAI #LearnAI #AIEngineering

Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts

Beyond Text-Only Interactions

Vision Plus Text: The Basics

Multi-Image Comparison Prompts

Code Plus Text: Structured Analysis

FAQ

How does image resolution affect quality and cost?

Can I combine images with tool-use in a single interaction?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Enterprise CIO Guide: Anthropic Skills — Loadable Agent Tool Packs

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Claude's Published System Prompts: What They Reveal About Anthropic's Strategy

SMB Founder Playbook: Anthropic Skills — Loadable Agent Tool Packs

Chain-of-Thought Variants: ToT, GoT, Self-Consistency Compared

Beyond Text-Only Interactions

Vision Plus Text: The Basics

Multi-Image Comparison Prompts

Code Plus Text: Structured Analysis

Structured Multi-Modal Inputs

FAQ

Do all models support multi-modal prompts the same way?

How does image resolution affect quality and cost?

Can I combine images with tool-use in a single interaction?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Enterprise CIO Guide: Anthropic Skills — Loadable Agent Tool Packs

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Claude's Published System Prompts: What They Reveal About Anthropic's Strategy

SMB Founder Playbook: Anthropic Skills — Loadable Agent Tool Packs

Chain-of-Thought Variants: ToT, GoT, Self-Consistency Compared