By Sagar Shankaran, Founder of CallSphere
Why LLM-as-judge is the wrong tool for code agents — and how to build an execution-based eval pipeline that actually catches broken code.
Key takeaways
If you are building a code-writing agent and grading its output with another LLM acting as a judge, you have built a shiny machine that confidently lies to you. Code is the one domain where the ground truth is computable: either the program runs and the tests pass, or it does not. Execution-based evaluation — sandboxed compile, run, test, and lint — catches every class of failure that LLM-as-judge silently waves through, costs an order of magnitude less per eval, and produces metrics your CTO can actually defend in a board meeting. This post walks through a working code agent built on the OpenAI Agents SDK with a parallel LangGraph variant, a real sandboxed eval harness using Docker and E2B, and the actual numbers I recorded running a 10-problem HumanEval-style suite against `gpt-4o-2024-08-06`, `claude-sonnet-4-5-20251022`, and `gpt-4.1-mini-2025-04-14`. The headline: judge-LLMs scored every model 8–9 out of 10. Execution scored them 6, 7, and 4. Three of the "passing" judge-graded outputs did not even compile.
Judge-based evaluation became the default in 2024 because most agent outputs are open-ended text — customer support replies, summaries, classifications — and a calibrated LLM-as-judge with a clear rubric does a defensible job. The intellectual mistake is generalizing that pattern to code. Three failure modes I see weekly:
The fix is structurally simple: run the code. Compile it. Lint it. Execute the test suite. The signal is binary or near-binary, the cost is dominated by container startup not LLM calls, and the metric maps directly to "would this PR be merged." Everything else is theater.
Here is the OpenAI Agents SDK version. It exposes three tools: `write_file`, `read_file`, and `run_tests`. The agent's loop is generate-then-test, with the test output fed back as observation.
```python from agents import Agent, Runner, function_tool import subprocess from pathlib import Path
WORKDIR = Path("/sandbox/repo")
@function_tool def write_file(path: str, content: str) -> str: """Write content to a file inside the sandbox.""" full = WORKDIR / path full.parent.mkdir(parents=True, exist_ok=True) full.write_text(content) return f"wrote {len(content)} bytes to {path}"
@function_tool def read_file(path: str) -> str: """Read a file from the sandbox.""" return (WORKDIR / path).read_text()
@function_tool def run_tests(test_path: str = "tests/") -> str: """Run pytest against the sandbox repo. Returns combined stdout+stderr.""" proc = subprocess.run( ["python", "-m", "pytest", test_path, "-q", "--tb=short"], cwd=WORKDIR, capture_output=True, text=True, timeout=60, ) return f"exit={proc.returncode}\n{proc.stdout}\n{proc.stderr}"
code_agent = Agent( name="code-writer", model="gpt-4o-2024-08-06", instructions=( "You are a precise Python engineer. Implement the requested function. " "After every write, call run_tests and read the output. " "Iterate until tests pass or you have tried 5 times. Never skip running tests." ), tools=[write_file, read_file, run_tests], )
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
result = Runner.run_sync(code_agent, "Implement solution.py per the task description in PROBLEM.md.") print(result.final_output) ```
The LangGraph variant is the same shape — a single `ToolNode` wired to a model node with a conditional edge that loops while the agent calls tools and exits when it produces a final message. I will not duplicate the wiring here; the LangGraph quickstart covers it. The substantive choice is not the framework. It is what happens after the agent finishes.
```mermaid flowchart LR A[Problem prompt] --> B[Agent generates code] B --> C[Write to sandbox] C --> D[Compile / import check] D -->|fail| Z[Score: 0] D -->|ok| E[Run test suite] E --> F[Run linter] F --> G[Measure runtime + memory] G --> H[Aggregate per-problem score] H --> I{All problems done?} I -->|no| A I -->|yes| J[Suite report] style Z fill:#fcc style J fill:#cfc style D fill:#ffd style E fill:#ffd ```
Figure 1 — Execution-based eval. Notice there is no judge LLM in this graph. The compiler, the test runner, and the linter are the judges.
The four signals we capture per problem:
You cannot run untrusted LLM-generated code on your laptop, and you definitely cannot run it on your CI runner without isolation. The three production-grade options:
| Sandbox | Cold start | Per-eval cost | Best for |
|---|---|---|---|
| Local Docker (rootless, network-off) | 1.2 s | $0 (your hardware) | Dev loop, small suites |
| E2B | 250 ms | ~$0.0004 | Parallel cloud eval, scale-out |
| Modal sandbox | 800 ms | ~$0.0009 | Heavier deps, GPU-adjacent |
For the 10-problem suite below I used E2B because the parallel cold-starts dominate at small scale. For our internal nightly suite of 340 problems we use Modal because we cache the image with the test deps preinstalled and the cold start amortizes away.
The minimal Docker-based sandbox runner:
```python import docker, json, time from pathlib import Path
client = docker.from_env()
def run_in_sandbox(repo_dir: Path, timeout_s: int = 60) -> dict: container = client.containers.run( image="python:3.11-slim", command=["sh", "-c", "pip install -q pytest ruff && " "python -c 'import solution' >/tmp/import.log 2>&1; " "echo $? > /tmp/import.exit; " "ruff check . > /tmp/lint.log 2>&1; " "echo $? > /tmp/lint.exit; " "/usr/bin/time -f '%e' pytest -q tests/ > /tmp/test.log 2>&1; " "echo $? > /tmp/test.exit"], volumes={str(repo_dir): {"bind": "/work", "mode": "rw"}}, working_dir="/work", network_mode="none", # no exfiltration, no pip mirror tricks mem_limit="512m", cpu_quota=50_000, # 0.5 CPU detach=True, remove=False, ) try: container.wait(timeout=timeout_s) except Exception: container.kill() logs = container.logs().decode() container.remove(force=True) return parse_sandbox_logs(repo_dir, logs) ```
The `network_mode="none"` is non-negotiable. Code agents will, given the chance, `pip install` packages mid-eval to "fix" import errors and your eval becomes a measurement of internet connectivity instead of model capability.
I ran the same 10-problem suite (drawn from a HumanEval-shaped private benchmark — function-completion problems with hidden test sets) under two graders: `gpt-4o-2024-08-06` as a judge with a detailed rubric, and the execution harness above. Same model under test in both cases (`claude-sonnet-4-5-20251022`).
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
| Failure mode | Judge LLM verdict | Execution verdict | Example |
|---|---|---|---|
| Hallucinated import | "Solution looks correct, 9/10" | ImportError, 0/n tests | `from itertools import flatten` (does not exist) |
| Wrong return type | "Matches spec" | TypeError on assert | Returns list, spec demanded tuple |
| Off-by-one on boundary | "Logic is sound" | 7/10 tests | Empty list returns `[None]` instead of `[]` |
| Mutates input arg | "Clean implementation" | 8/10 tests | Second test reuses the input |
| O(n²) where O(n) was needed | "Reasonable approach" | Timeout on stress test | Nested loop on n=10⁵ |
| Pure syntax error after edit | "9/10, minor style" | SyntaxError, file did not import | Forgot a closing paren on the last edit |
Six categories. The judge waved through all six. The compiler and the test runner caught all six. The only thing the judge added on top was a vague "code style: 9/10" which my linter does for free and for less than a thousandth of the cost.
Setup: 10 function-completion problems (string manipulation, list ops, light dynamic programming, two graph problems). Hidden test sets average 12 cases per problem. Each model got up to 5 generate-test-revise iterations per problem, max 60-second test timeout, 512 MB memory cap, network off. Sandboxes ran on E2B with the test dependencies pre-baked.
| Model | Pass@1 (judge) | Pass@1 (execution) | Tests passed / total | Lint clean | Median wall-time |
|---|---|---|---|---|---|
| `gpt-4.1-mini-2025-04-14` | 8/10 | 4/10 | 71/120 | 6/10 | 14 s |
| `gpt-4o-2024-08-06` | 9/10 | 6/10 | 92/120 | 8/10 | 22 s |
| `claude-sonnet-4-5-20251022` | 9/10 | 7/10 | 101/120 | 9/10 | 19 s |
Three observations a senior engineer should internalize:
For a deeper treatment of how this generalizes to PR-level benchmarks, see our companion piece on SWE-bench evaluation for coding agents. The trace-and-fix discipline that catches the long tail of these failures in production is covered in our trace-to-production-fix workflow.
You do not need a public benchmark to start. The most valuable code eval suite I have ever shipped was 24 problems extracted from our own engineering team's GitHub history: bugs we actually fixed, with the failing test as the spec and the merged fix as the reference. That suite caught regressions no public benchmark would have, because the failure modes mapped to our codebase's real shape.
Pragmatic recipe:
This is the same playbook our continuous evaluation in CI/CD post applies to non-code agents — datasets are living artifacts, generated from real history, never from a model's imagination of what a problem should look like.
A single judge-LLM eval pass on a 10-problem suite using `gpt-4o-2024-08-06` cost about $0.42 in input + output tokens. The execution-based eval cost about $0.04 in E2B sandbox minutes plus $0 in judge tokens. Across our nightly 340-problem suite that is the difference between $4.20 a night and $43 a night — and the cheaper option is the more accurate one. Compute beats taste in code eval, full stop.
Execution-based eval is not free of friction:
Yes — for axes execution does not measure: code style adherence, comment quality, API ergonomics, and structural decisions like "did the agent create the right files in the right places." Use judge as a tiebreaker between two models that both passed the execution eval, never as the gate.
Hide the tests. The agent gets the problem statement and a public test scaffold (1–2 obvious cases). The grading harness runs a private suite the agent never sees. This is exactly how SWE-bench Verified is structured, and it is the only defense against "the agent wrote `return expected_value`."
Mock at the boundary. The sandbox provides a fake HTTP server with deterministic responses; the eval grades the agent on whether it called the right endpoint with the right args. This is the same pattern we use for the voice and chat agents on CallSphere — the eval is hermetic, the production agent is not.
Floating aliases (`gpt-4o`, `claude-sonnet-4-5`) silently change underneath you and your historical baselines stop being comparable. Pin to the dated snapshot (`gpt-4o-2024-08-06`, `claude-sonnet-4-5-20251022`) and accept new snapshots only after a deliberate baseline-reset experiment. This is non-negotiable for any benchmark you want to trust over months.
SWE-bench is execution-based eval at a much harder difficulty: full-repo bug fixes instead of single-function completions. Same philosophy, much bigger sandbox. Our SWE-bench guide covers the harness, cost, and how to build cheaper internal variants on your own repo's PRs.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI