Skip to content
Agentic AI
Agentic AI12 min read0 views

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Pairwise (A vs B) LLM-as-judge evaluation produces sharper, more reliable signal than absolute scoring for non-deterministic agent outputs. Here is why and how.

TL;DR

For non-deterministic agent outputs, pairwise LLM-as-judge evaluation — show the judge two candidates A and B, ask which is better — produces dramatically sharper signal than absolute scoring against a rubric or a reference answer. I've watched teams chase phantom 0.03 average-score improvements for months under absolute scoring, only to discover the judge model was randomly drifting; the same teams flipped to pairwise and saw real preferences emerge in a single afternoon. This post explains the statistical reason pairwise wins, the failure modes of reference-based scoring on agents, when to still use reference-based eval (it has its place), and how to actually wire pairwise into LangSmith with code you can run today.

The Core Problem: Agents Don't Have One Right Answer

Reference-based evaluation works when there is a golden output. "What is 17 * 23?" → 391. Easy. "Write a Python function that reverses a string" → assert reversed("hello") == "olleh". Easy.

Now: "Help this customer who called in upset because their last invoice was higher than expected." There are at least a dozen acceptable responses — empathic acknowledgment first vs. solution first, offering a credit vs. explaining usage, escalating vs. resolving. None of them are "right." All of them are evaluable on dimensions like empathy, accuracy, latency, and resolution. Reference-based scoring has no model for this. Pairwise scoring does.

The deeper problem is that absolute scoring asks the judge to do something humans are bad at: assign calibrated numbers on a continuous scale. Ask 10 people to rate a coffee shop 1-10 and you'll get a mean around 7.4 with high variance. Ask the same 10 people to compare two coffee shops side-by-side and you'll get >85% agreement on which is better. LLM judges have the same property — and worse, their absolute calibration drifts when you swap to a new model version. Pairwise sidesteps both problems.

Reference-Based vs LLM-as-Judge vs Pairwise: The Spectrum

flowchart LR
  A[Agent output] --> B{Is there a golden answer?}
  B -->|Yes, narrow space| C[Reference-based<br/>exact match / embedding sim]
  B -->|No, open-ended| D{Single output or compare?}
  D -->|Single output| E[Absolute LLM-as-judge<br/>1-5 rubric score]
  D -->|Two candidates| F[Pairwise LLM-as-judge<br/>A vs B preference]
  C --> G[Score: 0/1 or 0-1 sim]
  E --> H[Score: noisy, miscalibrated]
  F --> I[Score: win-rate, low variance]
  H -->|drifts on model swap| J[Hard to compare across runs]
  I -->|stable across model swaps| K[Direct comparison of versions]

Three takeaways from this diagram. First, reference-based evaluation is still the right tool when the answer space is narrow — math problems, structured extraction, code with deterministic output, JSON schema validation. Don't throw it away. Second, absolute LLM-as-judge is the worst of both worlds for open-ended outputs: it pretends to give you a calibrated number while actually giving you a noisy one. Third, pairwise LLM-as-judge is what you reach for when comparing two agent versions, two prompts, or a candidate vs an incumbent — which is most of what you do day-to-day.

Why Pairwise Wins, Statistically

The cleanest way to see why pairwise wins is to think about what each method is asking the judge to estimate. Absolute scoring asks: "What is the true quality Q of this output, on a fixed scale, in absolute terms?" Pairwise asks: "Is output A better than output B?" The first is a regression problem with no anchor. The second is a binary classification problem with a clear decision boundary.

Three concrete failure modes of absolute scoring you can stop running into immediately:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. Scale compression. LLM judges asked for 1-5 ratings cluster outputs in 3-4 with vanishingly few 1s and 5s. Effective dynamic range collapses to ~1.5 points, and the noise floor (run-to-run variance on the same input) is often 0.4-0.6. Your "improvement" is rarely above the noise.
  2. Cross-version drift. Run the same evaluator with GPT-4o today and GPT-4o-2025-08 next month and absolute scores shift by 0.2-0.4 points across the board. You can't tell whether your agent improved or the judge changed. Pairwise is much more stable because both candidates are scored by the same judge in the same call.
  3. Position and verbosity bias. LLM judges have well-documented biases — they prefer the first option, prefer longer responses, prefer responses that flatter the user. In absolute scoring these biases are baked into the score. In pairwise you can mitigate them with position swapping (run A-vs-B and B-vs-A, only count agreed wins) and length-normalized rubrics.

A practical comparison from a real CallSphere experiment last quarter:

Metric Absolute LLM-judge Pairwise LLM-judge
Run-to-run variance (same data) ±0.31 (on 5-pt scale) ±4.2% win rate
Effective dynamic range 1.5 points (compressed) 0-100% win rate
Significance of 5% prompt change Not detectable p < 0.01, n=400
Cost per 800-example run (GPT-4o) ~$3.20 ~$4.80
Stable across judge model swap No (Δ ~0.25) Yes (Δ ~2% win rate)

The 50% cost overhead of pairwise is real but trivial relative to the signal it produces.

When You Should Still Use Reference-Based

Pairwise is not a universal hammer. Reference-based evaluation remains the right call when:

  • The output space is structured. JSON conformance, function-call argument correctness, SQL query equivalence, classification labels. Just write the assert.
  • Latency or cost dominates. Heuristic checks run in milliseconds and cost zero. If a check is reliably automatable, automate it; don't burn judge tokens on it.
  • Regulatory traceability matters. "Our system passed 1,240 of 1,250 reference test cases" is a sentence auditors understand. "Our system has a 73% pairwise win rate against the previous version" is not.
  • You're doing red-team or safety eval. "Did the agent produce a banned phrase" is a hard constraint, not a preference comparison.

The mature stack uses both. In CallSphere's eval suite, roughly 40% of evaluators are reference-based or heuristic (the structural and safety ones), and 50% are pairwise LLM-as-judge (the quality ones). The remaining 10% is single-output absolute scoring, used only where pairwise is awkward — like rating a single trace for a known antipattern that doesn't have a meaningful "alternative" to compare against.

How to Wire Pairwise Eval into LangSmith

LangSmith's evaluate_comparative API is built specifically for this. You hand it two experiments (typically your candidate vs your baseline, both already run on the same dataset), define a pairwise evaluator function, and it returns a per-example preference plus aggregate win rates.

from langsmith import Client
from langsmith.evaluation import evaluate, evaluate_comparative
from openai import OpenAI
import json, random

client = Client()
oai = OpenAI()

def pairwise_judge(runs: list, example) -> dict:
    """Compare two candidate runs. Position-swap to mitigate bias."""
    a, b = runs[0], runs[1]
    swap = random.random() < 0.5
    first, second = (b, a) if swap else (a, b)

    prompt = f"""You are an evaluator for customer-support agent replies.
Pick the BETTER reply on (1) factual correctness, (2) empathy, (3) action-clarity.
Ignore length unless one is clearly padded.

User asked: {example.inputs['user_query']}

Reply A: {first.outputs['final_answer']}

Reply B: {second.outputs['final_answer']}

Return JSON: {{"winner": "A" | "B" | "tie", "reason": str}}"""

    resp = oai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    parsed = json.loads(resp.choices[0].message.content)
    winner = parsed["winner"]
    if swap and winner in ("A", "B"):
        winner = "B" if winner == "A" else "A"

    # Score for the FIRST run (the candidate, by convention)
    candidate_score = 1.0 if winner == "A" else 0.0 if winner == "B" else 0.5
    return {
        "key": "pairwise_quality",
        "scores": [candidate_score, 1.0 - candidate_score if winner != "tie" else 0.5],
        "comment": parsed["reason"],
    }

# Run experiments first
exp_baseline = evaluate(
    baseline_agent, data="support-agent-eval-v3",
    experiment_prefix="baseline-prompt-v16",
)
exp_candidate = evaluate(
    candidate_agent, data="support-agent-eval-v3",
    experiment_prefix="candidate-prompt-v17",
)

# Then run the pairwise comparison
comp = evaluate_comparative(
    [exp_candidate.experiment_name, exp_baseline.experiment_name],
    evaluators=[pairwise_judge],
    max_concurrency=8,
)
print(f"Candidate win rate: {comp.aggregate_score('pairwise_quality'):.1%}")

A few production-grade details that aren't obvious from the API surface:

  • Position swap is non-negotiable. Without it you'll see a 4-8% bias toward whichever candidate gets shown first. The code above swaps with 50% probability and inverts the winner; this produces unbiased preferences in expectation.
  • Temperature 0 on the judge. Judges should be as deterministic as possible. Save the creativity for the agent under test, not the grader.
  • Run the comparison as a separate LangSmith run-type. That way the pairwise score lives next to the absolute scores in the experiment-diff UI and your team can see both views.

In TypeScript, the same pattern with the JS SDK:

import { Client } from "langsmith";
import { evaluate, evaluateComparative } from "langsmith/evaluation";
import OpenAI from "openai";

const ls = new Client();
const oai = new OpenAI();

const pairwiseJudge = async ({ runs, example }: any) => {
  const [a, b] = runs;
  const swap = Math.random() < 0.5;
  const [first, second] = swap ? [b, a] : [a, b];

  const resp = await oai.chat.completions.create({
    model: "gpt-4o",
    temperature: 0,
    response_format: { type: "json_object" },
    messages: [{
      role: "user",
      content: `Pick the better reply on correctness, empathy, action-clarity.
User: ${example.inputs.user_query}
A: ${first.outputs.final_answer}
B: ${second.outputs.final_answer}
JSON: {"winner":"A"|"B"|"tie","reason":string}`,
    }],
  });

  const parsed = JSON.parse(resp.choices[0].message.content!);
  let winner = parsed.winner;
  if (swap && winner !== "tie") winner = winner === "A" ? "B" : "A";

  const candidateScore = winner === "A" ? 1 : winner === "B" ? 0 : 0.5;
  return {
    key: "pairwise_quality",
    scores: [candidateScore, winner === "tie" ? 0.5 : 1 - candidateScore],
    comment: parsed.reason,
  };
};

const comp = await evaluateComparative(
  ["candidate-prompt-v17", "baseline-prompt-v16"],
  { evaluators: [pairwiseJudge], maxConcurrency: 8 }
);

The Pairwise Eval Loop, Visualized

graph TD
  A[Candidate agent vN+1] --> C[Run on dataset]
  B[Baseline agent vN] --> C
  C --> D[Two experiment objects]
  D --> E[Pairwise judge call]
  E --> E1[Position A first]
  E --> E2[Position B first]
  E1 --> F{Agree?}
  E2 --> F
  F -->|Yes| G[Confident preference]
  F -->|No| H[Mark as tie]
  G --> I[Aggregate win rate]
  H --> I
  I --> J{Win rate >= 55%?}
  J -->|Yes, n>=400| K[Ship candidate]
  J -->|No| L[Iterate prompt/model]
  L --> A

The 55% / n=400 threshold is a rule of thumb, not gospel. Statistically a 55% win rate on 400 paired observations is significant at p < 0.05 against the null of 50/50; tighten or loosen based on how risk-tolerant your deploy is. Safety-critical changes I want at 60%+ on n=800. Cosmetic prompt tweaks I'll ship at 53% on n=200 if it costs nothing.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Pitfalls That Will Bite You

Three traps even senior teams fall into:

  • Self-preference bias. If the judge model is the same family as the agent under test, the judge slightly prefers outputs from its own family. Mitigation: use a judge from a different provider (GPT judges Claude agents, Claude judges GPT agents) for cross-family comparisons.
  • Distribution drift in the dataset. Pairwise wins on dataset v3 don't transfer to dataset v4 if v4 has a different intent distribution. Always re-baseline when the dataset changes; don't compare experiments across dataset versions.
  • Over-reliance on win rate alone. A 55% win rate could mean "candidate is uniformly slightly better" or "candidate is much better on 30% of cases and slightly worse on 70%." Always inspect per-slice win rates. The slice view is where production regressions hide.

Where Pairwise Fits in the Broader Stack

Pairwise LLM-as-judge is one stage in the full agent eval stack — instrument, trace, dataset, evaluator, score, CI gate — described in the agent evaluation stack post. It's the highest-signal single evaluator most teams have, but it's not the whole picture. You still want heuristic gates for hard constraints, reference-based for structured outputs, and human review for the calibration set that keeps the LLM judge honest. The art is composing them.

How CallSphere Uses Pairwise Eval

Across our voice and chat agents, pairwise LLM-as-judge is the primary metric on every prompt-change PR. Each vertical has its own dataset of 400-1,200 paired comparisons against the prior production version. We run position-swapped pairs with GPT-4o as judge for healthcare and real-estate intents (where empathy matters most), and Claude as judge for technical IT helpdesk intents (where reasoning depth matters most). The cross-family judge choice came from a quarter-long calibration study: GPT-4o agreed with humans 87% on empathy-heavy intents, Claude agreed 89% on reasoning-heavy intents. We rotate judges quarterly to catch judge drift.

FAQ

Q: Is pairwise eval just RLHF reward modeling? The judging signal is similar — both are preference comparisons — but pairwise eval is for offline experiment scoring, while reward modeling trains a model. Same input shape, different downstream use. You can absolutely train a small reward model on the pairwise data you collect.

Q: How many pairs do I need for a credible result? Rule of thumb: at 55% win rate, you need ~400 paired comparisons for p<0.05. At 60% win rate, 100 is enough. At 51-52%, 1500+. If you're chasing margins under 53%, the eval is probably noise.

Q: Should I use a frontier judge or a cheap one? Cheap judges (GPT-4o-mini, Claude Haiku) are roughly 70-80% as agreeing-with-humans as frontier models on simple rubrics, at 5-10x lower cost. For PR-blocking eval, use frontier. For nightly bulk re-eval, cheap is fine. Always calibrate against humans first.

Q: What about MT-Bench / Chatbot Arena style multi-turn pairwise? Same principle, more scaffolding. You wrap the entire conversation, not just one reply. LangSmith supports this — pass the full message thread as the judge input. Arena-style ELO ratings are useful when you have many candidates; for two-candidate A/B, raw win rate is simpler.

Q: Can I skip the judge entirely with embedding similarity? Embedding similarity is a reference-based metric in disguise — it requires a reference. It's also surprisingly weak on agents because it scores surface-level lexical overlap, not correctness or empathy. Use it for retrieval relevance, not for agent quality.

Bottom Line

Stop optimizing absolute LLM-judge scores you can't trust. Switch to pairwise. Position-swap. Use a cross-family judge. Calibrate against humans monthly. The signal-to-noise ratio of your eval program will go up by an order of magnitude in the first week — and your team will stop arguing about whether 0.78 is meaningfully better than 0.76.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.