By Sagar Shankaran, Founder of CallSphere
Pairwise (A vs B) LLM-as-judge evaluation produces sharper, more reliable signal than absolute scoring for non-deterministic agent outputs. Here is why and how.
Key takeaways
For non-deterministic agent outputs, pairwise LLM-as-judge evaluation — show the judge two candidates A and B, ask which is better — produces dramatically sharper signal than absolute scoring against a rubric or a reference answer. I've watched teams chase phantom 0.03 average-score improvements for months under absolute scoring, only to discover the judge model was randomly drifting; the same teams flipped to pairwise and saw real preferences emerge in a single afternoon. This post explains the statistical reason pairwise wins, the failure modes of reference-based scoring on agents, when to still use reference-based eval (it has its place), and how to actually wire pairwise into LangSmith with code you can run today.
Reference-based evaluation works when there is a golden output. "What is 17 * 23?" → 391. Easy. "Write a Python function that reverses a string" → assert reversed("hello") == "olleh". Easy.
Now: "Help this customer who called in upset because their last invoice was higher than expected." There are at least a dozen acceptable responses — empathic acknowledgment first vs. solution first, offering a credit vs. explaining usage, escalating vs. resolving. None of them are "right." All of them are evaluable on dimensions like empathy, accuracy, latency, and resolution. Reference-based scoring has no model for this. Pairwise scoring does.
The deeper problem is that absolute scoring asks the judge to do something humans are bad at: assign calibrated numbers on a continuous scale. Ask 10 people to rate a coffee shop 1-10 and you'll get a mean around 7.4 with high variance. Ask the same 10 people to compare two coffee shops side-by-side and you'll get >85% agreement on which is better. LLM judges have the same property — and worse, their absolute calibration drifts when you swap to a new model version. Pairwise sidesteps both problems.
flowchart LR
A[Agent output] --> B{Is there a golden answer?}
B -->|Yes, narrow space| C[Reference-based<br/>exact match / embedding sim]
B -->|No, open-ended| D{Single output or compare?}
D -->|Single output| E[Absolute LLM-as-judge<br/>1-5 rubric score]
D -->|Two candidates| F[Pairwise LLM-as-judge<br/>A vs B preference]
C --> G[Score: 0/1 or 0-1 sim]
E --> H[Score: noisy, miscalibrated]
F --> I[Score: win-rate, low variance]
H -->|drifts on model swap| J[Hard to compare across runs]
I -->|stable across model swaps| K[Direct comparison of versions]
Three takeaways from this diagram. First, reference-based evaluation is still the right tool when the answer space is narrow — math problems, structured extraction, code with deterministic output, JSON schema validation. Don't throw it away. Second, absolute LLM-as-judge is the worst of both worlds for open-ended outputs: it pretends to give you a calibrated number while actually giving you a noisy one. Third, pairwise LLM-as-judge is what you reach for when comparing two agent versions, two prompts, or a candidate vs an incumbent — which is most of what you do day-to-day.
The cleanest way to see why pairwise wins is to think about what each method is asking the judge to estimate. Absolute scoring asks: "What is the true quality Q of this output, on a fixed scale, in absolute terms?" Pairwise asks: "Is output A better than output B?" The first is a regression problem with no anchor. The second is a binary classification problem with a clear decision boundary.
Three concrete failure modes of absolute scoring you can stop running into immediately:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
A practical comparison from a real CallSphere experiment last quarter:
| Metric | Absolute LLM-judge | Pairwise LLM-judge |
|---|---|---|
| Run-to-run variance (same data) | ±0.31 (on 5-pt scale) | ±4.2% win rate |
| Effective dynamic range | 1.5 points (compressed) | 0-100% win rate |
| Significance of 5% prompt change | Not detectable | p < 0.01, n=400 |
| Cost per 800-example run (GPT-4o) | ~$3.20 | ~$4.80 |
| Stable across judge model swap | No (Δ ~0.25) | Yes (Δ ~2% win rate) |
The 50% cost overhead of pairwise is real but trivial relative to the signal it produces.
Pairwise is not a universal hammer. Reference-based evaluation remains the right call when:
The mature stack uses both. In CallSphere's eval suite, roughly 40% of evaluators are reference-based or heuristic (the structural and safety ones), and 50% are pairwise LLM-as-judge (the quality ones). The remaining 10% is single-output absolute scoring, used only where pairwise is awkward — like rating a single trace for a known antipattern that doesn't have a meaningful "alternative" to compare against.
LangSmith's evaluate_comparative API is built specifically for this. You hand it two experiments (typically your candidate vs your baseline, both already run on the same dataset), define a pairwise evaluator function, and it returns a per-example preference plus aggregate win rates.
from langsmith import Client
from langsmith.evaluation import evaluate, evaluate_comparative
from openai import OpenAI
import json, random
client = Client()
oai = OpenAI()
def pairwise_judge(runs: list, example) -> dict:
"""Compare two candidate runs. Position-swap to mitigate bias."""
a, b = runs[0], runs[1]
swap = random.random() < 0.5
first, second = (b, a) if swap else (a, b)
prompt = f"""You are an evaluator for customer-support agent replies.
Pick the BETTER reply on (1) factual correctness, (2) empathy, (3) action-clarity.
Ignore length unless one is clearly padded.
User asked: {example.inputs['user_query']}
Reply A: {first.outputs['final_answer']}
Reply B: {second.outputs['final_answer']}
Return JSON: {{"winner": "A" | "B" | "tie", "reason": str}}"""
resp = oai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0,
)
parsed = json.loads(resp.choices[0].message.content)
winner = parsed["winner"]
if swap and winner in ("A", "B"):
winner = "B" if winner == "A" else "A"
# Score for the FIRST run (the candidate, by convention)
candidate_score = 1.0 if winner == "A" else 0.0 if winner == "B" else 0.5
return {
"key": "pairwise_quality",
"scores": [candidate_score, 1.0 - candidate_score if winner != "tie" else 0.5],
"comment": parsed["reason"],
}
# Run experiments first
exp_baseline = evaluate(
baseline_agent, data="support-agent-eval-v3",
experiment_prefix="baseline-prompt-v16",
)
exp_candidate = evaluate(
candidate_agent, data="support-agent-eval-v3",
experiment_prefix="candidate-prompt-v17",
)
# Then run the pairwise comparison
comp = evaluate_comparative(
[exp_candidate.experiment_name, exp_baseline.experiment_name],
evaluators=[pairwise_judge],
max_concurrency=8,
)
print(f"Candidate win rate: {comp.aggregate_score('pairwise_quality'):.1%}")
A few production-grade details that aren't obvious from the API surface:
In TypeScript, the same pattern with the JS SDK:
import { Client } from "langsmith";
import { evaluate, evaluateComparative } from "langsmith/evaluation";
import OpenAI from "openai";
const ls = new Client();
const oai = new OpenAI();
const pairwiseJudge = async ({ runs, example }: any) => {
const [a, b] = runs;
const swap = Math.random() < 0.5;
const [first, second] = swap ? [b, a] : [a, b];
const resp = await oai.chat.completions.create({
model: "gpt-4o",
temperature: 0,
response_format: { type: "json_object" },
messages: [{
role: "user",
content: `Pick the better reply on correctness, empathy, action-clarity.
User: ${example.inputs.user_query}
A: ${first.outputs.final_answer}
B: ${second.outputs.final_answer}
JSON: {"winner":"A"|"B"|"tie","reason":string}`,
}],
});
const parsed = JSON.parse(resp.choices[0].message.content!);
let winner = parsed.winner;
if (swap && winner !== "tie") winner = winner === "A" ? "B" : "A";
const candidateScore = winner === "A" ? 1 : winner === "B" ? 0 : 0.5;
return {
key: "pairwise_quality",
scores: [candidateScore, winner === "tie" ? 0.5 : 1 - candidateScore],
comment: parsed.reason,
};
};
const comp = await evaluateComparative(
["candidate-prompt-v17", "baseline-prompt-v16"],
{ evaluators: [pairwiseJudge], maxConcurrency: 8 }
);
graph TD
A[Candidate agent vN+1] --> C[Run on dataset]
B[Baseline agent vN] --> C
C --> D[Two experiment objects]
D --> E[Pairwise judge call]
E --> E1[Position A first]
E --> E2[Position B first]
E1 --> F{Agree?}
E2 --> F
F -->|Yes| G[Confident preference]
F -->|No| H[Mark as tie]
G --> I[Aggregate win rate]
H --> I
I --> J{Win rate >= 55%?}
J -->|Yes, n>=400| K[Ship candidate]
J -->|No| L[Iterate prompt/model]
L --> A
The 55% / n=400 threshold is a rule of thumb, not gospel. Statistically a 55% win rate on 400 paired observations is significant at p < 0.05 against the null of 50/50; tighten or loosen based on how risk-tolerant your deploy is. Safety-critical changes I want at 60%+ on n=800. Cosmetic prompt tweaks I'll ship at 53% on n=200 if it costs nothing.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Three traps even senior teams fall into:
Pairwise LLM-as-judge is one stage in the full agent eval stack — instrument, trace, dataset, evaluator, score, CI gate — described in the agent evaluation stack post. It's the highest-signal single evaluator most teams have, but it's not the whole picture. You still want heuristic gates for hard constraints, reference-based for structured outputs, and human review for the calibration set that keeps the LLM judge honest. The art is composing them.
Across our voice and chat agents, pairwise LLM-as-judge is the primary metric on every prompt-change PR. Each vertical has its own dataset of 400-1,200 paired comparisons against the prior production version. We run position-swapped pairs with GPT-4o as judge for healthcare and real-estate intents (where empathy matters most), and Claude as judge for technical IT helpdesk intents (where reasoning depth matters most). The cross-family judge choice came from a quarter-long calibration study: GPT-4o agreed with humans 87% on empathy-heavy intents, Claude agreed 89% on reasoning-heavy intents. We rotate judges quarterly to catch judge drift.
Q: Is pairwise eval just RLHF reward modeling? The judging signal is similar — both are preference comparisons — but pairwise eval is for offline experiment scoring, while reward modeling trains a model. Same input shape, different downstream use. You can absolutely train a small reward model on the pairwise data you collect.
Q: How many pairs do I need for a credible result? Rule of thumb: at 55% win rate, you need ~400 paired comparisons for p<0.05. At 60% win rate, 100 is enough. At 51-52%, 1500+. If you're chasing margins under 53%, the eval is probably noise.
Q: Should I use a frontier judge or a cheap one? Cheap judges (GPT-4o-mini, Claude Haiku) are roughly 70-80% as agreeing-with-humans as frontier models on simple rubrics, at 5-10x lower cost. For PR-blocking eval, use frontier. For nightly bulk re-eval, cheap is fine. Always calibrate against humans first.
Q: What about MT-Bench / Chatbot Arena style multi-turn pairwise? Same principle, more scaffolding. You wrap the entire conversation, not just one reply. LangSmith supports this — pass the full message thread as the judge input. Arena-style ELO ratings are useful when you have many candidates; for two-candidate A/B, raw win rate is simpler.
Q: Can I skip the judge entirely with embedding similarity? Embedding similarity is a reference-based metric in disguise — it requires a reference. It's also surprisingly weak on agents because it scores surface-level lexical overlap, not correctness or empathy. Use it for retrieval relevance, not for agent quality.
Stop optimizing absolute LLM-judge scores you can't trust. Switch to pairwise. Position-swap. Use a cross-family judge. Calibrate against humans monthly. The signal-to-noise ratio of your eval program will go up by an order of magnitude in the first week — and your team will stop arguing about whether 0.78 is meaningfully better than 0.76.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.
How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.
Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.
How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.
Memory is supposed to make agents better — but does it? Build a memory eval pipeline that measures recall, precision, contradiction rate, and the freshness/staleness tradeoff.
Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI