---
title: "LLM-as-Judge: Using AI to Evaluate AI Agent Responses Automatically"
description: "Learn how to use LLMs as automated judges to evaluate AI agent responses with scoring rubrics, calibration techniques, and multi-criteria evaluation frameworks in Python."
canonical: https://callsphere.ai/blog/llm-as-judge-ai-evaluate-agent-responses-automatically
category: "Learn Agentic AI"
tags: ["LLM-as-Judge", "Evaluation", "AI Agents", "Automated Testing", "Python", "Scoring Rubrics"]
author: "CallSphere Team"
published: 2026-03-17T00:00:00.000Z
updated: 2026-05-06T12:16:34.368Z
---

# LLM-as-Judge: Using AI to Evaluate AI Agent Responses Automatically

> Learn how to use LLMs as automated judges to evaluate AI agent responses with scoring rubrics, calibration techniques, and multi-criteria evaluation frameworks in Python.

## Why Use an LLM to Judge Another LLM

Human evaluation is the gold standard for assessing agent quality, but it does not scale. Reviewing 500 agent responses manually takes days. LLM-as-Judge is a technique where you use a strong language model to score the outputs of your agent automatically, giving you scalable evaluation that correlates well with human judgment when calibrated correctly.

Research from teams at Google, Anthropic, and OpenAI shows that GPT-4-class models achieve 80-90% agreement with human raters on well-defined criteria. The key is writing precise rubrics and calibrating the judge against human labels.

## Basic Judge Implementation

A judge is simply an LLM call with a structured prompt that asks for a score and justification.

```mermaid
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness
PromptFoo or Braintrust"]
    GOLD[("Golden set
200 tagged cases")]
    JUDGE["LLM as judge
plus regex graders"]
    SCORE["Aggregate score
and per slice"]
    GATE{"Score regress
more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
```

```python
import openai
import json
from dataclasses import dataclass

@dataclass
class JudgeResult:
    score: int           # 1-5
    reasoning: str
    criteria_scores: dict[str, int]

def evaluate_response(
    question: str,
    agent_response: str,
    reference_answer: str,
    client: openai.OpenAI,
    model: str = "gpt-4o",
) -> JudgeResult:
    prompt = f"""You are an expert evaluator. Score the following agent response.

Question: {question}
Reference Answer: {reference_answer}
Agent Response: {agent_response}

Score each criterion from 1 (poor) to 5 (excellent):
1. Correctness: Is the information accurate?
2. Completeness: Does it address all parts of the question?
3. Clarity: Is the response well-organized and easy to understand?

Return JSON:
{{"correctness": , "completeness": , "clarity": , "overall": , "reasoning": ""}}
"""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    data = json.loads(response.choices[0].message.content)
    return JudgeResult(
        score=data["overall"],
        reasoning=data["reasoning"],
        criteria_scores={
            k: data[k] for k in ["correctness", "completeness", "clarity"]
        },
    )
```

## Designing Scoring Rubrics

Vague rubrics produce inconsistent scores. Anchor each score level to concrete behaviors.

```python
RUBRIC = """
## Correctness Rubric
- 5: All facts are accurate, no hallucinations
- 4: Minor inaccuracy that does not affect the main answer
- 3: One significant error, but the core answer is correct
- 2: Multiple errors or a critical factual mistake
- 1: The answer is fundamentally wrong or fabricated

## Completeness Rubric
- 5: Addresses every part of the question with sufficient detail
- 4: Addresses all parts but one lacks detail
- 3: Misses one part of a multi-part question
- 2: Only partially addresses the question
- 1: Fails to address the question at all
"""

def build_judge_prompt(question: str, response: str, rubric: str = RUBRIC) -> str:
    return f"""Evaluate this agent response using the rubric below.

{rubric}

Question: {question}
Response: {response}

Return JSON with scores and reasoning for each criterion."""
```

## Calibration Against Human Labels

Before trusting LLM-as-Judge scores, calibrate by comparing judge scores to human ratings on a labeled subset.

```python
import numpy as np
from scipy import stats

def calibrate_judge(
    human_scores: list[int],
    judge_scores: list[int],
) -> dict:
    """Compare judge scores against human ground truth."""
    correlation, p_value = stats.spearmanr(human_scores, judge_scores)

    exact_match = sum(h == j for h, j in zip(human_scores, judge_scores))
    within_one = sum(abs(h - j)  float:
    total = sum(criteria_scores[k] * weights[k] for k in criteria_scores)
    weight_sum = sum(weights[k] for k in criteria_scores)
    return round(total / weight_sum, 2)

# For a customer support agent, correctness matters most
SUPPORT_WEIGHTS = {"correctness": 0.5, "completeness": 0.3, "clarity": 0.2}

# For a creative writing agent, clarity matters most
CREATIVE_WEIGHTS = {"correctness": 0.2, "completeness": 0.2, "clarity": 0.6}
```

## Running Batch Evaluations

Evaluate your full dataset efficiently with concurrent judge calls.

```python
import asyncio

async def batch_evaluate(
    eval_cases: list[dict],
    agent_fn,
    judge_fn,
    concurrency: int = 5,
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def process_case(case):
        async with semaphore:
            agent_output = await agent_fn(case["input"])
            judge_result = await judge_fn(
                case["input"], agent_output, case["expected"]
            )
            return {**case, "output": agent_output, "judge": judge_result}

    tasks = [process_case(c) for c in eval_cases]
    results = await asyncio.gather(*tasks)
    return results
```

## FAQ

### Does the judge model need to be stronger than the agent model?

Yes, generally. A GPT-4o judge evaluating GPT-3.5 agent outputs works well. Judging a model with an equally capable or weaker model produces unreliable scores because the judge may share the same blind spots.

### How do I prevent position bias in the judge?

When comparing two responses (A vs B), run the evaluation twice — once with A first, once with B first — and average the results. This counteracts the tendency for LLMs to prefer whichever response appears first.

### How much does LLM-as-Judge cost compared to human evaluation?

Evaluating 1,000 responses with GPT-4o costs roughly two to five dollars depending on response length. The same volume of human evaluation typically costs hundreds of dollars and takes days. LLM-as-Judge is roughly 50-100x cheaper.

---

#LLMasJudge #Evaluation #AIAgents #AutomatedTesting #Python #ScoringRubrics #AgenticAI #LearnAI #AIEngineering

---

Source: https://callsphere.ai/blog/llm-as-judge-ai-evaluate-agent-responses-automatically
