OpenAI JSON Mode and Structured Outputs: Reliable Data Extraction

The Problem with Unstructured LLM Output

By default, LLMs return free-form text. When you need structured data — a JSON object with specific fields, types, and constraints — you are relying on the model to follow your prompt instructions perfectly. It usually works, but sometimes the model wraps JSON in markdown code fences, adds extra commentary, omits fields, or returns invalid JSON.

OpenAI provides two mechanisms to solve this: JSON mode and structured outputs. Both guarantee valid JSON, but structured outputs go further by enforcing a specific schema.

JSON Mode: Guaranteed Valid JSON

JSON mode ensures the model outputs valid JSON, but does not enforce a specific structure:

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract the person's details as JSON with name, age, and city fields."},
        {"role": "user", "content": "John Smith is 34 years old and lives in Chicago."},
    ],
    response_format={"type": "json_object"},
)

import json
data = json.loads(response.choices[0].message.content)
print(data)
# {"name": "John Smith", "age": 34, "city": "Chicago"}

Important: You must mention JSON in your system or user message when using JSON mode. The API requires this and will error if you do not.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Structured Outputs: Schema-Enforced JSON

Structured outputs go beyond JSON mode by enforcing a specific JSON schema:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract product information from the text."},
        {"role": "user", "content": "The MacBook Pro 16-inch costs $2499, weighs 4.8 lbs, and has an M3 Max chip."},
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "product_info",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "product_name": {"type": "string"},
                    "price_usd": {"type": "number"},
                    "weight_lbs": {"type": "number"},
                    "processor": {"type": "string"},
                },
                "required": ["product_name", "price_usd", "weight_lbs", "processor"],
                "additionalProperties": False,
            },
        },
    },
)

data = json.loads(response.choices[0].message.content)
print(data)

With strict: True, the model is constrained to output JSON that conforms exactly to your schema. Every required field will be present, types will match, and no extra fields will appear.

Pydantic Integration

The SDK integrates with Pydantic models for a cleaner developer experience:

from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class ContactInfo(BaseModel):
    name: str
    email: str
    phone: str
    company: str

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract contact information from the text."},
        {"role": "user", "content": "Reach out to Sarah Connor at sarah@skynet.com or 555-0199. She works at Cyberdyne Systems."},
    ],
    response_format=ContactInfo,
)

contact = response.choices[0].message.parsed
print(f"Name: {contact.name}")
print(f"Email: {contact.email}")
print(f"Phone: {contact.phone}")
print(f"Company: {contact.company}")

The .parse() method automatically converts the Pydantic model into a JSON schema, sends it to the API, and parses the response back into a typed Pydantic instance.

Nested and Complex Schemas

Structured outputs support nested objects, arrays, and enums:

from pydantic import BaseModel
from enum import Enum

class Severity(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"

class Step(BaseModel):
    description: str
    estimated_hours: float

class BugReport(BaseModel):
    title: str
    severity: Severity
    affected_component: str
    steps_to_reproduce: list[Step]
    expected_behavior: str
    actual_behavior: str

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Parse the bug report into structured format."},
        {"role": "user", "content": "Critical bug in the payment module. When a user clicks 'Pay Now' with an expired card (takes 2 seconds), the system shows a success message instead of an error. Expected: error message. Actual: success confirmation."},
    ],
    response_format=BugReport,
)

bug = response.choices[0].message.parsed
print(f"Title: {bug.title}")
print(f"Severity: {bug.severity}")
print(f"Steps: {len(bug.steps_to_reproduce)}")

Handling Refusals

Sometimes the model refuses to fill the schema (e.g., for safety reasons). Check for this:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

response = client.beta.chat.completions.parse(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Extract the information."},
        {"role": "user", "content": "Some input text here."},
    ],
    response_format=ContactInfo,
)

message = response.choices[0].message
if message.refusal:
    print(f"Model refused: {message.refusal}")
else:
    contact = message.parsed
    print(contact)

Practical Example: Invoice Parsing

Here is a realistic data extraction pipeline:

from pydantic import BaseModel

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    date: str
    vendor_name: str
    line_items: list[LineItem]
    subtotal: float
    tax: float
    total: float

def parse_invoice(raw_text: str) -> Invoice:
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Parse the invoice text into structured data. Calculate totals if not explicitly stated."},
            {"role": "user", "content": raw_text},
        ],
        response_format=Invoice,
    )
    return response.choices[0].message.parsed

FAQ

What is the difference between JSON mode and structured outputs?

JSON mode guarantees the output is valid JSON but does not enforce a specific structure. Structured outputs enforce a specific JSON schema with exact field names, types, and constraints. Use JSON mode for flexibility, structured outputs for reliability.

Do structured outputs work with all OpenAI models?

Structured outputs with json_schema require GPT-4o or later models. JSON mode (json_object) is supported by GPT-4o, GPT-4o-mini, and GPT-3.5-turbo. Check the API documentation for the latest model compatibility.

Can I use optional fields in structured output schemas?

With strict: True, all properties must be listed in required. To make a field optional, use a union type with null: {"type": ["string", "null"]}. In Pydantic, use Optional[str] with a default of None.

#OpenAI #JSONMode #StructuredOutputs #Pydantic #DataExtraction #AgenticAI #LearnAI #AIEngineering

OpenAI JSON Mode and Structured Outputs: Reliable Data Extraction

The Problem with Unstructured LLM Output

JSON Mode: Guaranteed Valid JSON

Structured Outputs: Schema-Enforced JSON

Pydantic Integration

Nested and Complex Schemas

Handling Refusals

Practical Example: Invoice Parsing

FAQ

What is the difference between JSON mode and structured outputs?

Do structured outputs work with all OpenAI models?

Can I use optional fields in structured output schemas?

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

Model-Native Harness: Why OpenAI and Anthropic Are Killing ReAct Loops

GPT-Realtime-Whisper vs Deepgram: Streaming STT in 2026