---
title: "Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing"
description: "Understand how to evaluate text-to-SQL systems using the Spider and BIRD benchmarks, implement execution accuracy metrics, and build custom evaluation datasets for your specific database schema."
canonical: https://callsphere.ai/blog/text-to-sql-evaluation-spider-bird-benchmarks-accuracy-testing
category: "Learn Agentic AI"
tags: ["Evaluation", "Spider Benchmark", "BIRD Benchmark", "Text-to-SQL", "Accuracy Testing"]
author: "CallSphere Team"
published: 2026-03-18T00:00:00.000Z
updated: 2026-05-07T05:16:05.728Z
---

# Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing

> Understand how to evaluate text-to-SQL systems using the Spider and BIRD benchmarks, implement execution accuracy metrics, and build custom evaluation datasets for your specific database schema.

## Why Standard Evaluation Matters

Claiming your text-to-SQL system "works well" without rigorous evaluation is meaningless. Two systems that feel similar in casual testing might differ by 20% in accuracy on edge cases. Benchmarks give you objective measurements to compare models, track improvements, and identify weaknesses.

The text-to-SQL community has developed standardized benchmarks that test across hundreds of databases and thousands of question-query pairs. Understanding these benchmarks — and knowing when to build your own — is essential for production systems.

## The Spider Benchmark

Spider is the most widely used text-to-SQL benchmark. It contains 10,181 questions across 200 databases covering 138 domains. Questions are categorized by difficulty: easy (single table, no aggregation), medium (joins, grouping), hard (subqueries, set operations), and extra hard (nested queries, multiple conditions).

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

Key characteristics:

- **Cross-database evaluation** — the test set uses databases not seen during training
- **SQL complexity levels** — from simple SELECT to multi-level nested queries
- **Multiple valid SQL representations** — the same question might have several correct SQL formulations

```python
# Example Spider dataset entry
spider_example = {
    "db_id": "concert_singer",
    "question": "How many singers do we have?",
    "query": "SELECT count(*) FROM singer",
    "difficulty": "easy",
}
```

## The BIRD Benchmark

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) addresses limitations in Spider by using real-world databases with messy data, requiring external knowledge, and including value-based questions.

Key differences from Spider:

- **Dirty data** — databases contain NULLs, inconsistent formats, and realistic noise
- **External knowledge** — some questions require understanding domain conventions (e.g., "fiscal year starts in April")
- **Larger databases** — tables with millions of rows where query efficiency matters

## Evaluation Metrics

**Exact Match Accuracy (EM)** compares the predicted SQL string to the reference SQL. This is too strict — `SELECT name FROM users` and `SELECT users.name FROM users` are both correct but do not match.

**Execution Accuracy (EX)** runs both the predicted and reference SQL against the database and compares results. This is the standard metric because it correctly handles multiple valid SQL formulations.

```python
import sqlite3
from typing import Any

def execution_accuracy(predicted_sql: str, reference_sql: str,
                       db_path: str) -> bool:
    """Check if predicted and reference SQL return the same results."""
    conn = sqlite3.connect(db_path)

    try:
        pred_results = set(
            tuple(row) for row in conn.execute(predicted_sql).fetchall()
        )
        ref_results = set(
            tuple(row) for row in conn.execute(reference_sql).fetchall()
        )
        return pred_results == ref_results
    except Exception:
        return False
    finally:
        conn.close()

def evaluate_batch(test_cases: list[dict], model_fn, db_dir: str) -> dict:
    """Evaluate a text-to-SQL model on a batch of test cases."""
    results = {"total": 0, "correct": 0, "errors": 0, "by_difficulty": {}}

    for case in test_cases:
        results["total"] += 1
        db_path = f"{db_dir}/{case['db_id']}/{case['db_id']}.sqlite"

        try:
            predicted = model_fn(case["question"], db_path)
            is_correct = execution_accuracy(predicted, case["query"], db_path)

            if is_correct:
                results["correct"] += 1

            # Track by difficulty
            diff = case.get("difficulty", "unknown")
            if diff not in results["by_difficulty"]:
                results["by_difficulty"][diff] = {"total": 0, "correct": 0}
            results["by_difficulty"][diff]["total"] += 1
            if is_correct:
                results["by_difficulty"][diff]["correct"] += 1

        except Exception as e:
            results["errors"] += 1

    results["accuracy"] = results["correct"] / results["total"] if results["total"] > 0 else 0
    return results
```

## Building a Custom Evaluation Dataset

Standard benchmarks tell you how your model performs in general. But production accuracy depends on your specific schema, your users' question patterns, and your data characteristics. Build a custom evaluation set.

```python
import json
from dataclasses import dataclass, asdict

@dataclass
class EvalCase:
    question: str
    reference_sql: str
    difficulty: str  # easy, medium, hard
    category: str    # e.g., "aggregation", "join", "filter", "date_range"
    notes: str = ""  # Why this case is interesting

class EvalDatasetBuilder:
    """Build and manage a custom text-to-SQL evaluation dataset."""

    def __init__(self, db_path: str):
        self.db_path = db_path
        self.cases: list[EvalCase] = []

    def add_case(self, question: str, reference_sql: str,
                 difficulty: str, category: str, notes: str = ""):
        # Verify the reference SQL actually works
        conn = sqlite3.connect(self.db_path)
        try:
            conn.execute(reference_sql)
        except Exception as e:
            raise ValueError(
                f"Reference SQL is invalid: {e}\nSQL: {reference_sql}"
            )
        finally:
            conn.close()

        self.cases.append(EvalCase(
            question=question,
            reference_sql=reference_sql,
            difficulty=difficulty,
            category=category,
            notes=notes,
        ))

    def save(self, path: str):
        with open(path, "w") as f:
            json.dump([asdict(c) for c in self.cases], f, indent=2)

    def load(self, path: str):
        with open(path) as f:
            self.cases = [EvalCase(**c) for c in json.load(f)]

    def summary(self) -> dict:
        from collections import Counter
        return {
            "total_cases": len(self.cases),
            "by_difficulty": dict(Counter(c.difficulty for c in self.cases)),
            "by_category": dict(Counter(c.category for c in self.cases)),
        }

# Build your dataset
builder = EvalDatasetBuilder("production_analytics.db")
builder.add_case(
    question="How many orders were placed in January 2026?",
    reference_sql="SELECT COUNT(*) FROM orders WHERE order_date >= '2026-01-01' AND order_date < '2026-02-01'",
    difficulty="easy",
    category="date_range",
    notes="Tests date range filtering with boundary conditions",
)
builder.add_case(
    question="Which product category has the highest average order value?",
    reference_sql="""
        SELECT p.category, AVG(oi.unit_price * oi.quantity) as avg_value
        FROM order_items oi
        JOIN products p ON oi.product_id = p.id
        GROUP BY p.category
        ORDER BY avg_value DESC
        LIMIT 1
    """,
    difficulty="medium",
    category="aggregation",
    notes="Requires JOIN and aggregation with sorting",
)
builder.save("eval_dataset.json")
```

## Recommended Evaluation Set Composition

For a production system, aim for at least 100 test cases with this distribution:

- **30% easy** — single-table filters, counts, simple aggregations
- **40% medium** — two-table JOINs, GROUP BY with HAVING, date ranges
- **20% hard** — three+ table JOINs, subqueries, window functions
- **10% adversarial** — ambiguous questions, questions with no valid answer, domain-specific terminology

Cover every table and relationship in your schema. Ensure you have cases for each common question pattern your users ask.

## FAQ

### How often should I re-evaluate my text-to-SQL system?

Re-evaluate whenever you change the model, update the schema, modify the prompt, or add new tables. At minimum, run your evaluation suite weekly in CI/CD. Schema changes are the most common cause of accuracy regression — a renamed column can silently break queries the model previously got right.

### Is 80% accuracy good enough for production?

It depends on the use case. For exploratory analytics where users can verify results, 80% is workable with good error messaging. For automated reporting or dashboards where results are consumed without review, you need 95%+ accuracy. Most production systems use error correction loops to bridge this gap.

### Can I use Spider or BIRD results to predict production accuracy?

Benchmark accuracy provides a ceiling estimate, not a prediction. Your production accuracy will typically be 5-15% lower than benchmark scores because real users ask messier questions, your schema has domain-specific quirks, and benchmark questions are carefully written to be unambiguous. Always supplement benchmarks with custom evaluation on your own data.

---

#Evaluation #SpiderBenchmark #BIRDBenchmark #TextToSQL #AccuracyTesting #AgenticAI #MLOps #BenchmarkDriven

---

Source: https://callsphere.ai/blog/text-to-sql-evaluation-spider-bird-benchmarks-accuracy-testing
