Skip to content
Agentic AI
Agentic AI13 min read0 views

Code-Writing Agents in 2026: Execution-Based Evaluation Beats Everything Else

Why LLM-as-judge is the wrong tool for code agents — and how to build an execution-based eval pipeline that actually catches broken code.

TL;DR

If you are building a code-writing agent and grading its output with another LLM acting as a judge, you have built a shiny machine that confidently lies to you. Code is the one domain where the ground truth is computable: either the program runs and the tests pass, or it does not. Execution-based evaluation — sandboxed compile, run, test, and lint — catches every class of failure that LLM-as-judge silently waves through, costs an order of magnitude less per eval, and produces metrics your CTO can actually defend in a board meeting. This post walks through a working code agent built on the OpenAI Agents SDK with a parallel LangGraph variant, a real sandboxed eval harness using Docker and E2B, and the actual numbers I recorded running a 10-problem HumanEval-style suite against `gpt-4o-2024-08-06`, `claude-sonnet-4-5-20251022`, and `gpt-4.1-mini-2025-04-14`. The headline: judge-LLMs scored every model 8–9 out of 10. Execution scored them 6, 7, and 4. Three of the "passing" judge-graded outputs did not even compile.

Why LLM-as-Judge Is the Wrong Tool for Code

Judge-based evaluation became the default in 2024 because most agent outputs are open-ended text — customer support replies, summaries, classifications — and a calibrated LLM-as-judge with a clear rubric does a defensible job. The intellectual mistake is generalizing that pattern to code. Three failure modes I see weekly:

  1. Plausible-but-broken syntax. A judge LLM will happily approve a Python function that uses `async def` without an `await` or imports a library that does not exist. The output looks like code; it parses to a human reviewer; the judge gives it 9/10. It crashes on first call.
  2. Off-by-one and edge-case logic errors. A judge reads "this function reverses a linked list" and grades it on whether the prose-like description matches. It does not iterate the test cases. The function fails on `None` or single-element input. The judge does not notice because it is grading vibes.
  3. Hallucinated APIs. The agent invents a method that almost-but-not-quite exists in pandas or numpy. The judge, also fluent in pandas vibes, approves. The code raises `AttributeError` in production.

The fix is structurally simple: run the code. Compile it. Lint it. Execute the test suite. The signal is binary or near-binary, the cost is dominated by container startup not LLM calls, and the metric maps directly to "would this PR be merged." Everything else is theater.

Anatomy of the Code Agent

Here is the OpenAI Agents SDK version. It exposes three tools: `write_file`, `read_file`, and `run_tests`. The agent's loop is generate-then-test, with the test output fed back as observation.

```python from agents import Agent, Runner, function_tool import subprocess from pathlib import Path

WORKDIR = Path("/sandbox/repo")

@function_tool def write_file(path: str, content: str) -> str: """Write content to a file inside the sandbox.""" full = WORKDIR / path full.parent.mkdir(parents=True, exist_ok=True) full.write_text(content) return f"wrote {len(content)} bytes to {path}"

@function_tool def read_file(path: str) -> str: """Read a file from the sandbox.""" return (WORKDIR / path).read_text()

@function_tool def run_tests(test_path: str = "tests/") -> str: """Run pytest against the sandbox repo. Returns combined stdout+stderr.""" proc = subprocess.run( ["python", "-m", "pytest", test_path, "-q", "--tb=short"], cwd=WORKDIR, capture_output=True, text=True, timeout=60, ) return f"exit={proc.returncode}\n{proc.stdout}\n{proc.stderr}"

code_agent = Agent( name="code-writer", model="gpt-4o-2024-08-06", instructions=( "You are a precise Python engineer. Implement the requested function. " "After every write, call run_tests and read the output. " "Iterate until tests pass or you have tried 5 times. Never skip running tests." ), tools=[write_file, read_file, run_tests], )

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

result = Runner.run_sync(code_agent, "Implement solution.py per the task description in PROBLEM.md.") print(result.final_output) ```

The LangGraph variant is the same shape — a single `ToolNode` wired to a model node with a conditional edge that loops while the agent calls tools and exits when it produces a final message. I will not duplicate the wiring here; the LangGraph quickstart covers it. The substantive choice is not the framework. It is what happens after the agent finishes.

The Eval Loop, in Pictures

```mermaid flowchart LR A[Problem prompt] --> B[Agent generates code] B --> C[Write to sandbox] C --> D[Compile / import check] D -->|fail| Z[Score: 0] D -->|ok| E[Run test suite] E --> F[Run linter] F --> G[Measure runtime + memory] G --> H[Aggregate per-problem score] H --> I{All problems done?} I -->|no| A I -->|yes| J[Suite report] style Z fill:#fcc style J fill:#cfc style D fill:#ffd style E fill:#ffd ```

Figure 1 — Execution-based eval. Notice there is no judge LLM in this graph. The compiler, the test runner, and the linter are the judges.

The four signals we capture per problem:

  • Compile success — does the file even import?
  • Tests passed / total — the dominant signal. Partial credit is allowed; some problems have multiple test cases.
  • Lint score — `ruff` clean rate. Not a gate, but a tiebreaker between models that pass equally many tests.
  • Runtime — wall-clock to complete tests. Catches O(n²) solutions that pass small cases but blow the 60-second timeout on the real input.

Sandboxing: Docker, E2B, or Modal

You cannot run untrusted LLM-generated code on your laptop, and you definitely cannot run it on your CI runner without isolation. The three production-grade options:

Sandbox Cold start Per-eval cost Best for
Local Docker (rootless, network-off) 1.2 s $0 (your hardware) Dev loop, small suites
E2B 250 ms ~$0.0004 Parallel cloud eval, scale-out
Modal sandbox 800 ms ~$0.0009 Heavier deps, GPU-adjacent

For the 10-problem suite below I used E2B because the parallel cold-starts dominate at small scale. For our internal nightly suite of 340 problems we use Modal because we cache the image with the test deps preinstalled and the cold start amortizes away.

The minimal Docker-based sandbox runner:

```python import docker, json, time from pathlib import Path

client = docker.from_env()

def run_in_sandbox(repo_dir: Path, timeout_s: int = 60) -> dict: container = client.containers.run( image="python:3.11-slim", command=["sh", "-c", "pip install -q pytest ruff && " "python -c 'import solution' >/tmp/import.log 2>&1; " "echo $? > /tmp/import.exit; " "ruff check . > /tmp/lint.log 2>&1; " "echo $? > /tmp/lint.exit; " "/usr/bin/time -f '%e' pytest -q tests/ > /tmp/test.log 2>&1; " "echo $? > /tmp/test.exit"], volumes={str(repo_dir): {"bind": "/work", "mode": "rw"}}, working_dir="/work", network_mode="none", # no exfiltration, no pip mirror tricks mem_limit="512m", cpu_quota=50_000, # 0.5 CPU detach=True, remove=False, ) try: container.wait(timeout=timeout_s) except Exception: container.kill() logs = container.logs().decode() container.remove(force=True) return parse_sandbox_logs(repo_dir, logs) ```

The `network_mode="none"` is non-negotiable. Code agents will, given the chance, `pip install` packages mid-eval to "fix" import errors and your eval becomes a measurement of internet connectivity instead of model capability.

What LLM-as-Judge Misses, Concretely

I ran the same 10-problem suite (drawn from a HumanEval-shaped private benchmark — function-completion problems with hidden test sets) under two graders: `gpt-4o-2024-08-06` as a judge with a detailed rubric, and the execution harness above. Same model under test in both cases (`claude-sonnet-4-5-20251022`).

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Failure mode Judge LLM verdict Execution verdict Example
Hallucinated import "Solution looks correct, 9/10" ImportError, 0/n tests `from itertools import flatten` (does not exist)
Wrong return type "Matches spec" TypeError on assert Returns list, spec demanded tuple
Off-by-one on boundary "Logic is sound" 7/10 tests Empty list returns `[None]` instead of `[]`
Mutates input arg "Clean implementation" 8/10 tests Second test reuses the input
O(n²) where O(n) was needed "Reasonable approach" Timeout on stress test Nested loop on n=10⁵
Pure syntax error after edit "9/10, minor style" SyntaxError, file did not import Forgot a closing paren on the last edit

Six categories. The judge waved through all six. The compiler and the test runner caught all six. The only thing the judge added on top was a vague "code style: 9/10" which my linter does for free and for less than a thousandth of the cost.

Real Numbers From a 10-Problem Suite

Setup: 10 function-completion problems (string manipulation, list ops, light dynamic programming, two graph problems). Hidden test sets average 12 cases per problem. Each model got up to 5 generate-test-revise iterations per problem, max 60-second test timeout, 512 MB memory cap, network off. Sandboxes ran on E2B with the test dependencies pre-baked.

Model Pass@1 (judge) Pass@1 (execution) Tests passed / total Lint clean Median wall-time
`gpt-4.1-mini-2025-04-14` 8/10 4/10 71/120 6/10 14 s
`gpt-4o-2024-08-06` 9/10 6/10 92/120 8/10 22 s
`claude-sonnet-4-5-20251022` 9/10 7/10 101/120 9/10 19 s

Three observations a senior engineer should internalize:

  1. The judge had ~30 percentage points of false-positive headroom. It rated mini at 80% when it was actually 40%. That is not "noisy"; that is wrong in a way that would let bad code ship.
  2. The relative ranking flipped on one pair. Judge tied 4o and Sonnet at 9/10. Execution put Sonnet a full point ahead. If you were choosing a model based on judge eval, you would pick wrong.
  3. The lint signal correlated with execution pass rate, not judge score. The cheapest tool in the stack — `ruff` — was a better proxy for "does the code work" than the most expensive — a frontier model judge. That is a durable lesson.

For a deeper treatment of how this generalizes to PR-level benchmarks, see our companion piece on SWE-bench evaluation for coding agents. The trace-and-fix discipline that catches the long tail of these failures in production is covered in our trace-to-production-fix workflow.

Building Your Own Suite

You do not need a public benchmark to start. The most valuable code eval suite I have ever shipped was 24 problems extracted from our own engineering team's GitHub history: bugs we actually fixed, with the failing test as the spec and the merged fix as the reference. That suite caught regressions no public benchmark would have, because the failure modes mapped to our codebase's real shape.

Pragmatic recipe:

  1. Mine 20–40 PRs that closed bugs and had tests added in the same diff.
  2. Strip the fix; keep the failing test plus a one-paragraph problem statement.
  3. Pin each problem with a Docker image that has the right dependencies pre-installed.
  4. Run the full suite weekly; gate model upgrades on it.

This is the same playbook our continuous evaluation in CI/CD post applies to non-code agents — datasets are living artifacts, generated from real history, never from a model's imagination of what a problem should look like.

Cost Reality

A single judge-LLM eval pass on a 10-problem suite using `gpt-4o-2024-08-06` cost about $0.42 in input + output tokens. The execution-based eval cost about $0.04 in E2B sandbox minutes plus $0 in judge tokens. Across our nightly 340-problem suite that is the difference between $4.20 a night and $43 a night — and the cheaper option is the more accurate one. Compute beats taste in code eval, full stop.

Honest Tradeoffs

Execution-based eval is not free of friction:

  • Test authoring is real work. Hidden test sets need to be hand-written or mined. There is no shortcut to a high-quality test corpus.
  • Sandboxes are an ops surface. You will, at some point, debug why E2B images drift from your local Docker image. Pin everything.
  • Some problems do not have a clean execution answer. "Refactor this 800-line file for readability" has no test. For those, you fall back to LLM-as-judge plus structural metrics (cyclomatic complexity delta, etc.). Use the right tool.
  • Flaky tests poison the signal. Quarantine flaky problems aggressively; one flaky problem will dominate weeks of model-comparison reports.

Frequently Asked Questions

Should I ever use LLM-as-judge for code?

Yes — for axes execution does not measure: code style adherence, comment quality, API ergonomics, and structural decisions like "did the agent create the right files in the right places." Use judge as a tiebreaker between two models that both passed the execution eval, never as the gate.

How do I prevent the agent from gaming the test suite?

Hide the tests. The agent gets the problem statement and a public test scaffold (1–2 obvious cases). The grading harness runs a private suite the agent never sees. This is exactly how SWE-bench Verified is structured, and it is the only defense against "the agent wrote `return expected_value`."

What about agents that need to call out to external services?

Mock at the boundary. The sandbox provides a fake HTTP server with deterministic responses; the eval grades the agent on whether it called the right endpoint with the right args. This is the same pattern we use for the voice and chat agents on CallSphere — the eval is hermetic, the production agent is not.

Why pin model snapshots?

Floating aliases (`gpt-4o`, `claude-sonnet-4-5`) silently change underneath you and your historical baselines stop being comparable. Pin to the dated snapshot (`gpt-4o-2024-08-06`, `claude-sonnet-4-5-20251022`) and accept new snapshots only after a deliberate baseline-reset experiment. This is non-negotiable for any benchmark you want to trust over months.

How does this compare to running SWE-bench?

SWE-bench is execution-based eval at a much harder difficulty: full-repo bug fixes instead of single-function completions. Same philosophy, much bigger sandbox. Our SWE-bench guide covers the harness, cost, and how to build cheaper internal variants on your own repo's PRs.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.