Skip to content
Learn Agentic AI
Learn Agentic AI12 min read9 views

Multi-Table Text-to-SQL: Handling JOINs, Subqueries, and Complex Relationships

Master multi-table text-to-SQL challenges including JOIN inference, ambiguous column resolution, query planning for complex questions, and techniques that help LLMs reason across table relationships.

Why Multi-Table Queries Are Hard for LLMs

Single-table text-to-SQL is largely a solved problem — modern LLMs handle it with 95%+ accuracy. But when questions span multiple tables, accuracy drops to 70-80%. The challenge is not SQL syntax; it is relational reasoning. The LLM must determine which tables to join, through which columns, and how to handle one-to-many versus many-to-many relationships.

Consider this question: "What are the top 5 customers by total spending who have also left a review?" This requires joining customers, orders, and reviews, aggregating order totals, filtering by review existence, and sorting. Each of these steps introduces a potential error.

Schema Design for Multi-Table Reasoning

The way you present your schema directly affects multi-table accuracy. Explicitly state relationships and join paths.

flowchart TD
    START["Multi-Table Text-to-SQL: Handling JOINs, Subqueri…"] --> A
    A["Why Multi-Table Queries Are Hard for LL…"]
    A --> B
    B["Schema Design for Multi-Table Reasoning"]
    B --> C
    C["Join Path Inference"]
    C --> D
    D["Handling Ambiguous Column References"]
    D --> E
    E["Query Planning with Chain of Thought"]
    E --> F
    F["Common Multi-Table Pitfalls"]
    F --> G
    G["FAQ"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
SCHEMA_WITH_RELATIONSHIPS = """
-- Table Relationships:
-- customers.id -> orders.customer_id (one-to-many: a customer has many orders)
-- orders.id -> order_items.order_id (one-to-many: an order has many items)
-- products.id -> order_items.product_id (one-to-many: a product appears in many order items)
-- customers.id -> reviews.customer_id (one-to-many: a customer writes many reviews)
-- products.id -> reviews.product_id (one-to-many: a product has many reviews)

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    tier VARCHAR(20) DEFAULT 'standard'  -- standard, premium, enterprise
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    order_date DATE NOT NULL,
    status VARCHAR(20) NOT NULL  -- pending, shipped, delivered, cancelled
);

CREATE TABLE order_items (
    id SERIAL PRIMARY KEY,
    order_id INTEGER NOT NULL REFERENCES orders(id),
    product_id INTEGER NOT NULL REFERENCES products(id),
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL
);

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL,
    category VARCHAR(50) NOT NULL,
    price DECIMAL(10,2) NOT NULL
);

CREATE TABLE reviews (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    product_id INTEGER NOT NULL REFERENCES products(id),
    rating INTEGER CHECK (rating BETWEEN 1 AND 5),
    review_text TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);
"""

Join Path Inference

When the LLM needs to connect two tables that are not directly related, it must find an intermediate path. For example, connecting products to customers requires going through order_items and orders.

Build a helper that computes join paths and include them in the prompt:

from collections import defaultdict, deque

class SchemaGraph:
    """Graph representation of table relationships for join path finding."""

    def __init__(self):
        self.edges = defaultdict(list)

    def add_relationship(self, table_a: str, col_a: str,
                         table_b: str, col_b: str):
        self.edges[table_a].append((table_b, col_a, col_b))
        self.edges[table_b].append((table_a, col_b, col_a))

    def find_join_path(self, start: str, end: str) -> list[dict]:
        """BFS to find shortest join path between two tables."""
        if start == end:
            return []
        visited = {start}
        queue = deque([(start, [])])

        while queue:
            current, path = queue.popleft()
            for neighbor, from_col, to_col in self.edges[current]:
                if neighbor == end:
                    return path + [{
                        "from_table": current,
                        "from_col": from_col,
                        "to_table": neighbor,
                        "to_col": to_col,
                    }]
                if neighbor not in visited:
                    visited.add(neighbor)
                    queue.append((neighbor, path + [{
                        "from_table": current,
                        "from_col": from_col,
                        "to_table": neighbor,
                        "to_col": to_col,
                    }]))
        return []

# Build graph from schema
graph = SchemaGraph()
graph.add_relationship("customers", "id", "orders", "customer_id")
graph.add_relationship("orders", "id", "order_items", "order_id")
graph.add_relationship("products", "id", "order_items", "product_id")
graph.add_relationship("customers", "id", "reviews", "customer_id")
graph.add_relationship("products", "id", "reviews", "product_id")

# Find path from products to customers
path = graph.find_join_path("products", "customers")
for step in path:
    print(f"JOIN {step['to_table']} ON {step['from_table']}.{step['from_col']} = {step['to_table']}.{step['to_col']}")

Output:

JOIN order_items ON products.id = order_items.product_id
JOIN orders ON order_items.order_id = orders.id
JOIN customers ON orders.customer_id = customers.id

Handling Ambiguous Column References

When multiple tables have columns with the same name, the LLM must use table aliases. Add disambiguation instructions to your prompt:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

DISAMBIGUATION_RULES = """
Important disambiguation notes:
- Both orders and reviews have a 'created_at' column. Use o.created_at for
  order dates and r.created_at for review dates.
- Both order_items.unit_price and products.price exist. Use order_items.unit_price
  for actual transaction prices (may include discounts). Use products.price for
  current catalog price.
- Always use table aliases (c for customers, o for orders, oi for order_items,
  p for products, r for reviews).
"""

Query Planning with Chain of Thought

For complex multi-table questions, ask the LLM to plan before writing SQL. This decomposition step significantly improves accuracy.

PLANNING_PROMPT = """Given this question: "{question}"

Before writing SQL, plan your approach:
1. Which tables are needed?
2. What are the join conditions?
3. What aggregations are required?
4. What filters should be applied?
5. What should the result be sorted by?

Then write the SQL query.
"""

def text_to_sql_with_planning(question: str, schema: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Schema:\n{schema}\n{DISAMBIGUATION_RULES}"},
            {"role": "user", "content": PLANNING_PROMPT.format(question=question)},
        ],
        temperature=0,
    )
    text = response.choices[0].message.content
    # Extract SQL from the response (after the planning section)
    if "SELECT" in text.upper():
        sql_start = text.upper().index("SELECT")
        sql = text[sql_start:].split("\n\n")[0].strip().rstrip(";") + ";"
        return sql
    return text

Common Multi-Table Pitfalls

Double counting. When joining a one-to-many relationship, aggregations get inflated. If a customer has 3 orders and 2 reviews, a naive JOIN produces 6 rows. Use DISTINCT or subqueries to avoid this.

Missing GROUP BY columns. LLMs sometimes include non-aggregated columns in SELECT without adding them to GROUP BY. This is valid in MySQL but fails in PostgreSQL and SQLite.

Wrong join type. "Customers who have NOT placed orders" requires LEFT JOIN ... WHERE orders.id IS NULL, not INNER JOIN. The word "not" signals an anti-join pattern.

FAQ

How do I handle many-to-many relationships?

Many-to-many relationships require joining through a junction table. Include the junction table in your schema and add a comment like "students_courses is a junction table linking students and courses (many-to-many)." This explicit annotation helps the LLM choose the correct join path instead of trying to join the two main tables directly.

Should I decompose complex questions into multiple simpler queries?

Yes, for questions requiring data from unrelated table groups. A question like "Show me the top 5 products and the total number of customers" involves independent aggregations. Running two separate queries and combining results is often more accurate than forcing everything into a single query with multiple subqueries.

How do I handle self-joins?

Self-joins (e.g., employee-manager relationships) are challenging for LLMs because the same table appears twice. Add an explicit note like "employees.manager_id references employees.id (self-referencing: each employee has a manager who is also an employee)." Without this, the LLM may create a separate JOIN against a non-existent "managers" table.


#MultiTableSQL #JOINs #TextToSQL #QueryPlanning #DatabaseRelationships #AgenticAI #SQLReasoning #ComplexQueries

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Learn Agentic AI

Text-to-SQL Fundamentals: Converting Natural Language Questions to Database Queries

Learn what text-to-SQL is, how the architecture works from schema understanding to query generation, and why it is one of the most practical applications of large language models in enterprise software.

Learn Agentic AI

SQL Query Validation and Safety: Preventing Dangerous Queries from AI Agents

Learn how to build a robust SQL validation layer that prevents injection attacks, enforces read-only access, limits query complexity, and ensures AI-generated queries cannot damage your database.

Learn Agentic AI

Schema Representation for Text-to-SQL: How to Describe Your Database to LLMs

Master the art of schema representation for text-to-SQL systems. Learn how to format CREATE TABLE statements, add column descriptions, encode foreign key relationships, and provide sample data for maximum query accuracy.

Learn Agentic AI

Text-to-SQL with Open-Source Models: SQLCoder, NSQL, and DeFog SQLCoder

Compare open-source text-to-SQL models including SQLCoder, NSQL, and DeFog SQLCoder. Learn how to deploy them locally, fine-tune on your schema, and evaluate accuracy against commercial alternatives.

Learn Agentic AI

Text-to-SQL with Claude: Using Anthropic's API for Database Question Answering

Implement a text-to-SQL system using Anthropic's Claude API with tool use for SQL execution, multi-turn conversations, and structured output parsing for reliable database question answering.

Learn Agentic AI

Building a Text-to-SQL Agent with GPT-4: Schema-Aware Query Generation

Build a complete text-to-SQL agent using GPT-4 that extracts database schemas, generates SQL queries from natural language, executes them safely, and formats results for end users.