Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

Schema Representation for Text-to-SQL: How to Describe Your Database to LLMs

Master the art of schema representation for text-to-SQL systems. Learn how to format CREATE TABLE statements, add column descriptions, encode foreign key relationships, and provide sample data for maximum query accuracy.

Schema Representation Is the Highest-Leverage Optimization

When building text-to-SQL systems, teams typically focus on model selection and prompt engineering. But the single biggest factor in query accuracy is how you represent the database schema to the LLM. A well-formatted schema description can improve accuracy by 10-20% without changing anything else.

The reason is straightforward: the LLM can only reference columns and tables it knows about. Ambiguous column names, missing relationships, and absent context about what data each column actually contains are the root cause of most query failures.

Format 1: Raw CREATE TABLE Statements

The simplest approach is to dump the DDL directly. This works well for small schemas with self-descriptive column names.

flowchart TD
    START["Schema Representation for Text-to-SQL: How to Des…"] --> A
    A["Schema Representation Is the Highest-Le…"]
    A --> B
    B["Format 1: Raw CREATE TABLE Statements"]
    B --> C
    C["Format 2: Annotated Schema with Column …"]
    C --> D
    D["Format 3: Schema with Sample Data"]
    D --> E
    E["Format 4: Relationship-Focused Represen…"]
    E --> F
    F["Choosing the Right Format"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
CREATE TABLE customers (
  id SERIAL PRIMARY KEY,
  full_name VARCHAR(100) NOT NULL,
  email VARCHAR(255) UNIQUE NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE orders (
  id SERIAL PRIMARY KEY,
  customer_id INTEGER NOT NULL REFERENCES customers(id),
  total_amount DECIMAL(10,2) NOT NULL,
  status VARCHAR(20) DEFAULT 'pending',
  ordered_at TIMESTAMP DEFAULT NOW()
);

This format is effective because LLMs have been trained on millions of SQL DDL examples and can parse it natively.

Format 2: Annotated Schema with Column Descriptions

For real-world databases where column names are cryptic or ambiguous, add inline comments explaining what each column means.

def format_annotated_schema(tables: list[dict]) -> str:
    """Format schema with column descriptions as SQL comments."""
    output = []
    for table in tables:
        lines = [f"CREATE TABLE {table['name']} ("]
        for col in table["columns"]:
            line = f"  {col['name']} {col['type']}"
            if col.get("constraints"):
                line += f" {col['constraints']}"
            if col.get("description"):
                line += f"  -- {col['description']}"
            lines.append(line + ",")
        # Remove trailing comma from last column
        lines[-1] = lines[-1].rstrip(",")
        lines.append(");")
        output.append("\n".join(lines))
    return "\n\n".join(output)

# Example usage
tables = [
    {
        "name": "transactions",
        "columns": [
            {"name": "id", "type": "SERIAL", "constraints": "PRIMARY KEY", "description": "Auto-incremented transaction ID"},
            {"name": "acct_no", "type": "VARCHAR(20)", "constraints": "NOT NULL", "description": "Customer account number (format: ACC-XXXXX)"},
            {"name": "txn_amt", "type": "DECIMAL(12,2)", "constraints": "NOT NULL", "description": "Transaction amount in USD, negative for debits"},
            {"name": "txn_type", "type": "VARCHAR(10)", "constraints": "", "description": "One of: credit, debit, transfer, fee"},
            {"name": "posted_dt", "type": "DATE", "constraints": "NOT NULL", "description": "Date the transaction was posted, not initiated"},
        ],
    }
]
print(format_annotated_schema(tables))

The output includes comments like -- Transaction amount in USD, negative for debits that tell the LLM exactly how to interpret the data. Without this, a question like "total deposits" might incorrectly include negative values.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Format 3: Schema with Sample Data

Including a few sample rows gives the LLM concrete examples of what the data looks like. This is especially valuable for columns with encoded values or non-obvious formats.

def format_schema_with_samples(db_path: str, table_name: str, n_rows: int = 3) -> str:
    """Generate schema + sample rows for a table."""
    import sqlite3
    conn = sqlite3.connect(db_path)

    # Get CREATE TABLE
    ddl = conn.execute(
        "SELECT sql FROM sqlite_master WHERE name = ?", (table_name,)
    ).fetchone()[0]

    # Get sample rows
    cursor = conn.execute(f"SELECT * FROM {table_name} LIMIT {n_rows}")
    columns = [desc[0] for desc in cursor.description]
    rows = cursor.fetchall()
    conn.close()

    # Format as a readable table
    sample = f"\n/* Sample data from {table_name}:\n"
    sample += " | ".join(columns) + "\n"
    sample += "-" * 60 + "\n"
    for row in rows:
        sample += " | ".join(str(v) for v in row) + "\n"
    sample += "*/"

    return f"{ddl}\n{sample}"

This produces output like:

CREATE TABLE products (
  id INTEGER PRIMARY KEY,
  sku VARCHAR(20) NOT NULL,
  name TEXT NOT NULL,
  category VARCHAR(50),
  price DECIMAL(8,2)
);
/* Sample data from products:
id | sku | name | category | price
------------------------------------------------------------
1 | SKU-A100 | Wireless Keyboard | Electronics | 49.99
2 | SKU-B200 | Ergonomic Chair | Furniture | 299.00
3 | SKU-C300 | USB-C Hub | Electronics | 34.99
*/

Format 4: Relationship-Focused Representation

For schemas with many tables, explicitly stating relationships prevents the LLM from guessing wrong JOIN paths.

def format_relationships(tables: list[dict]) -> str:
    """Generate a relationship summary for multi-table schemas."""
    lines = ["## Table Relationships\n"]
    for table in tables:
        for fk in table.get("foreign_keys", []):
            lines.append(
                f"- {table['name']}.{fk['column']} references "
                f"{fk['ref_table']}.{fk['ref_column']} "
                f"({fk.get('relationship', 'many-to-one')})"
            )
    return "\n".join(lines)

Place this relationship summary before the CREATE TABLE statements in your prompt. This gives the LLM a high-level map before it dives into column details.

Choosing the Right Format

Schema Size Best Format Why
Under 10 tables Raw DDL + sample data Full context fits easily in the prompt
10-50 tables Annotated DDL + relationships Comments resolve ambiguity
50+ tables Two-stage (select relevant tables first) Prevents context window overflow

FAQ

Should I include indexes in the schema representation?

Generally no. Index definitions add noise without helping the LLM generate correct queries. The exception is if you want the LLM to generate performance-optimized queries — in that case, including index information helps it choose covered queries and avoid full table scans.

How do I handle views and materialized views?

Include views in your schema representation if users might ask questions about them. Format them as CREATE VIEW view_name AS ... so the LLM knows they are queryable. For materialized views, add a comment noting that data may be stale.

What if column names conflict across tables?

Explicitly note conflicts in your schema context: "Both orders.status and shipments.status exist but have different meanings. orders.status is one of pending/confirmed/cancelled. shipments.status is one of preparing/shipped/delivered." This prevents the LLM from confusing them in JOIN queries.


#SchemaDesign #TextToSQL #PromptEngineering #DatabaseContext #LLM #AgenticAI #SQLAccuracy #DataModeling

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

How to Train an AI Voice Agent on Your Business: Prompts, RAG, and Fine-Tuning

A practical guide to training an AI voice agent on your specific business — system prompts, RAG over knowledge bases, and when to fine-tune.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

Learn Agentic AI

Prompt Engineering for AI Agents: System Prompts, Tool Descriptions, and Few-Shot Patterns

Agent-specific prompt engineering techniques: crafting effective system prompts, writing clear tool descriptions for function calling, and few-shot examples that improve complex task performance.

Learn Agentic AI

Text-to-SQL Fundamentals: Converting Natural Language Questions to Database Queries

Learn what text-to-SQL is, how the architecture works from schema understanding to query generation, and why it is one of the most practical applications of large language models in enterprise software.

Learn Agentic AI

Text-to-SQL with Open-Source Models: SQLCoder, NSQL, and DeFog SQLCoder

Compare open-source text-to-SQL models including SQLCoder, NSQL, and DeFog SQLCoder. Learn how to deploy them locally, fine-tune on your schema, and evaluate accuracy against commercial alternatives.

Learn Agentic AI

Multi-Table Text-to-SQL: Handling JOINs, Subqueries, and Complex Relationships

Master multi-table text-to-SQL challenges including JOIN inference, ambiguous column resolution, query planning for complex questions, and techniques that help LLMs reason across table relationships.