---
title: "Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts"
description: "Master multi-modal prompting techniques that combine text, images, and code inputs in a single prompt to unlock more capable and context-rich LLM interactions."
canonical: https://callsphere.ai/blog/multi-modal-prompting-text-images-code-single-prompts
category: "Learn Agentic AI"
tags: ["Prompt Engineering", "Multi-Modal", "Vision", "GPT-4o", "Python"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-07T13:51:44.715Z
---

# Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts

> Master multi-modal prompting techniques that combine text, images, and code inputs in a single prompt to unlock more capable and context-rich LLM interactions.

## Beyond Text-Only Interactions

Multi-modal models like GPT-4o, Claude, and Gemini accept not just text but images, documents, and structured data in a single prompt. This opens up use cases that were impossible with text-only prompting — analyzing screenshots, interpreting charts, debugging UI layouts, and reasoning over diagrams alongside natural language instructions.

The challenge is learning how to structure these mixed-modality prompts effectively. A poorly structured multi-modal prompt wastes the model's attention on irrelevant visual details or fails to connect the image content to the text instructions.

## Vision Plus Text: The Basics

The most common multi-modal pattern combines an image with a text instruction. The key is being specific about what the model should focus on in the image:

```mermaid
flowchart TD
    SPEC(["Task spec"])
    SYSTEM["System prompt
role plus rules"]
    SHOTS["Few shot examples
3 to 5"]
    VARS["Variable injection
Jinja or f-string"]
    COT["Chain of thought
or scratchpad"]
    CONSTR["Output constraint
JSON schema"]
    LLM["LLM call"]
    EVAL["Offline eval
LLM as judge plus regex"]
    GATE{"Score over
threshold?"}
    COMMIT(["Promote to prod
version pinned"])
    REVISE(["Revise prompt"])
    SPEC --> SYSTEM --> SHOTS --> VARS --> COT --> CONSTR --> LLM --> EVAL --> GATE
    GATE -->|Yes| COMMIT
    GATE -->|No| REVISE --> SYSTEM
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EVAL fill:#f59e0b,stroke:#d97706,color:#1f2937
    style COMMIT fill:#059669,stroke:#047857,color:#fff
```

```python
import openai
import base64
from pathlib import Path

client = openai.OpenAI()

def encode_image(image_path: str) -> str:
    """Encode an image to base64 for the API."""
    image_data = Path(image_path).read_bytes()
    return base64.b64encode(image_data).decode("utf-8")

def analyze_image(
    image_path: str,
    instruction: str,
    detail: str = "high",
) -> str:
    """Analyze an image with a specific text instruction."""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": instruction},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                        "detail": detail,
                    },
                },
            ]},
        ],
    )
    return response.choices[0].message.content

# Specific instruction beats generic "describe this image"
result = analyze_image(
    "dashboard_screenshot.png",
    "Identify all error states visible in this dashboard screenshot. "
    "For each error, note the component name, the error message, "
    "and suggest a likely root cause based on the displayed data."
)
```

The `detail` parameter matters for cost and quality. Use `"high"` when the image contains small text, code, or fine details. Use `"low"` for simple diagrams or when you only need a general understanding.

## Multi-Image Comparison Prompts

You can include multiple images in a single prompt for comparison tasks:

```python
def compare_designs(
    before_path: str,
    after_path: str,
    focus_areas: list[str],
) -> str:
    """Compare two UI designs and identify differences."""
    before_b64 = encode_image(before_path)
    after_b64 = encode_image(after_path)
    focus = ", ".join(focus_areas)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": (
                    "Compare these two UI designs. The first image is "
                    "the BEFORE state and the second is the AFTER state. "
                    f"Focus specifically on: {focus}. "
                    "List every visual difference you find."
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{before_b64}",
                    "detail": "high",
                }},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{after_b64}",
                    "detail": "high",
                }},
            ]},
        ],
    )
    return response.choices[0].message.content
```

## Code Plus Text: Structured Analysis

Combining code snippets with natural language context produces better analysis than either alone:

```python
def review_code_with_context(
    code: str,
    language: str,
    architecture_description: str,
    review_focus: list[str],
) -> str:
    """Review code with architectural context."""
    focus_items = "\n".join(f"- {f}" for f in review_focus)

    prompt = (
        f"## Architecture Context\n\n{architecture_description}\n\n"
        f"## Code to Review\n\n"
        f"~~~{language}\n{code}\n~~~\n\n"
        f"## Review Focus Areas\n\n{focus_items}\n\n"
        "Provide a structured review addressing each focus area. "
        "Reference specific line numbers and suggest concrete fixes."
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content
```

## Structured Multi-Modal Inputs

For complex tasks, structure your multi-modal prompt with clear sections:

```python
def structured_multimodal_prompt(
    text_context: str,
    image_paths: list[str],
    code_snippet: str,
    task: str,
) -> str:
    """Build a structured multi-modal prompt."""
    content = [
        {"type": "text", "text": f"## Task\n\n{task}\n\n## Context\n\n{text_context}\n\n## Relevant Code\n\n~~~python\n{code_snippet}\n~~~\n\n## Visual References\n\nAnalyze the following images in order:"},
    ]

    for i, path in enumerate(image_paths):
        content.append(
            {"type": "text", "text": f"\nImage {i + 1}:"}
        )
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/png;base64,{encode_image(path)}",
                "detail": "high",
            },
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
    )
    return response.choices[0].message.content
```

The pattern of labeling images ("Image 1:", "Image 2:") and providing context before the images helps the model understand the relationship between modalities. Without this structure, the model may describe each image independently rather than integrating information across all inputs.

## FAQ

### Do all models support multi-modal prompts the same way?

No. The API format varies by provider. OpenAI uses content arrays with `type: "text"` and `type: "image_url"` objects. Anthropic uses `type: "image"` with base64 data in a `source` block. Google Gemini uses `inline_data` with `mime_type`. Always check the provider's documentation for the exact format.

### How does image resolution affect quality and cost?

Higher resolution images consume more tokens. GPT-4o's `detail: "high"` mode tiles the image into 512x512 patches and processes each separately, costing roughly 85 tokens per tile. A 2048x2048 image uses about 1360 tokens. Use `detail: "low"` (85 tokens flat) when fine detail is not needed to save significantly on cost.

### Can I combine images with tool-use in a single interaction?

Yes. Multi-modal inputs work alongside function calling and tool use. A practical example is an agent that receives a screenshot, uses vision to understand the UI state, calls a tool to interact with the application, and then takes another screenshot to verify the result — all within a single agent loop.

---

#PromptEngineering #MultiModal #Vision #GPT4o #Python #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/multi-modal-prompting-text-images-code-single-prompts
