SWE-bench in 2026: How to Evaluate Your Coding Agent Like Anthropic and OpenAI Do
A practical guide to running SWE-bench (and SWE-bench Verified / Lite) on your own coding agent, plus the cheaper internal benchmarks that actually move the needle.
TL;DR
Every frontier-lab announcement in the last 18 months — Claude 3.5 Sonnet, GPT-4.1, GPT-5, Claude Sonnet 4.5 — quotes a SWE-bench number. Most engineering teams I talk to have never actually run it. The bar is lower than you think: a working SWE-bench-Lite run on a custom agent fits in an afternoon and a few hundred dollars of compute. The bar to do it responsibly — with the right harness, the right cost ceiling, and the right interpretation — is higher. This post walks through what SWE-bench actually measures, how Lite differs from Verified differs from full, the working harness invocation, a custom-agent submission shape using the OpenAI Agents SDK and a LangGraph variant, and — critically — why you should also be running a private internal benchmark mined from your own repo's PRs. The internal benchmark is free to run, more representative of your production code, and what frontier labs almost certainly run alongside the public number.
What SWE-bench Actually Is
SWE-bench is a benchmark of real GitHub issues and PRs from large open-source Python projects (django, sympy, sphinx, scikit-learn, flask, requests, matplotlib, astropy, xarray, pylint, pytest, seaborn). Each instance gives the agent:
- A repository at a specific pre-fix commit.
- An issue text describing the bug or feature.
- A hidden test suite that the merged fix made pass.
The agent must produce a patch that, when applied, makes the hidden tests pass without breaking the existing tests. The grading is execution-based — same philosophy as our execution-based code eval guide, just at PR scale instead of function scale.
There are three variants you will see quoted:
| Variant | Instances | Difficulty | Typical full-suite cost | When to use |
|---|---|---|---|---|
| SWE-bench (full) | 2,294 | Mixed; some unsolvable | $400–$2000 | Public leaderboard chasing |
| SWE-bench Verified | 500 | Human-verified solvable | $80–$400 | Real model comparison |
| SWE-bench Lite | 300 | Filtered to file-localized fixes | $40–$200 | Internal iteration |
Use Verified, not full. The original SWE-bench has roughly 8–12% of instances that the dataset authors later flagged as ambiguous, broken, or actually unsolvable from the issue text alone (e.g., the "ground truth" PR depended on context not in the issue). SWE-bench Verified is the OpenAI-led human-curated subset that strips those out. Every credible 2025+ frontier-lab announcement quotes Verified for this reason. Quoting full-suite numbers in 2026 is a tell that someone has not been paying attention.
Lite is for development, not for marketing. Lite filters to instances where the fix touches one file. That makes it cheaper and faster, but it systematically rewards agents that are good at single-file edits and underweights agents that are good at cross-file reasoning. Use it as your inner-loop eval; never compare across labs on it.
The Harness, Demystified
```mermaid flowchart TD A[Pick instances: Verified or Lite] --> B[For each instance: clone repo at base commit] B --> C[Spin Docker container with repo + deps] C --> D[Pass issue text to your agent] D --> E[Agent reads files, edits files, writes patch] E --> F[Apply patch to repo] F --> G[Run hidden test suite] G --> H{Did target tests pass?} H -->|yes + no regressions| I[Mark RESOLVED] H -->|tests still fail| J[Mark FAIL] H -->|broke other tests| K[Mark REGRESSION] I --> L[Aggregate: % resolved] J --> L K --> L style I fill:#cfc style J fill:#fcc style K fill:#fcc ```
Figure 1 — The official harness. Each instance is a sealed Docker container. The agent never sees the hidden tests. The grade is binary per instance.
The official harness is the `SWE-bench/SWE-bench` repository. It does three things you should not try to reimplement:
- Maintains a registry of pre-built Docker images per instance, so you do not spend a week getting sympy 1.7's test suite to install.
- Applies your patch and grades it deterministically.
- Distinguishes "your agent solved it" from "your agent broke other tests."
You implement one thing: a function that, given an instance, produces a unified diff patch.
Running SWE-bench-Lite Against a Custom Agent
The submission contract is simple: produce a JSONL where each line is `{"instance_id": "...", "model_patch": "diff --git ..."}`. Then the harness scores it.
Step one — install and pull instance images:
```bash pip install swebench python -m swebench.harness.run_evaluation \ --predictions_path /dev/null \ --max_workers 8 \ --run_id setup-pull \ --dataset_name princeton-nlp/SWE-bench_Lite \ --split test \ --instance_ids astropy__astropy-12907 django__django-11099 ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
That triggers the harness to fetch the prebuilt Docker images. On a first run this is the longest step (~40 minutes for Lite on a fast pipe).
Step two — your agent. Here is the OpenAI Agents SDK shape; the LangGraph shape is structurally identical (one tool node, one model node, one conditional edge). The agent runs inside the instance's Docker container so file edits land where the harness expects them.
```python import json, subprocess from pathlib import Path from agents import Agent, Runner, function_tool
REPO = Path("/testbed") # the harness mounts the repo here
@function_tool def list_files(directory: str = ".") -> str: """List files in a directory, relative to repo root.""" p = REPO / directory return "\n".join(sorted(str(f.relative_to(REPO)) for f in p.rglob("*") if f.is_file()))
@function_tool def read_file(path: str) -> str: """Read a file relative to repo root.""" return (REPO / path).read_text()
@function_tool def write_file(path: str, content: str) -> str: """Overwrite a file relative to repo root.""" (REPO / path).write_text(content) return f"wrote {len(content)} bytes to {path}"
@function_tool def run_tests(test_path: str = "") -> str: """Run the project's test suite. Empty test_path = run all.""" cmd = ["python", "-m", "pytest", "-x", "-q"] if test_path: cmd.append(test_path) p = subprocess.run(cmd, cwd=REPO, capture_output=True, text=True, timeout=300) return f"exit={p.returncode}\n{p.stdout[-4000:]}\n{p.stderr[-2000:]}"
swe_agent = Agent( name="swe-agent", model="claude-sonnet-4-5-20251022", instructions=( "You are fixing a bug in a real Python repository. " "Read the issue, explore the relevant files, make a minimal patch, " "run the test suite, iterate up to 8 times. Do not edit tests. " "When done, return the final summary." ), tools=[list_files, read_file, write_file, run_tests], )
def solve_instance(instance: dict) -> str: prompt = f"Repo: {instance['repo']}\nIssue:\n{instance['problem_statement']}" Runner.run_sync(swe_agent, prompt, max_turns=20) diff = subprocess.check_output(["git", "diff"], cwd=REPO, text=True) return diff ```
Step three — the prediction file:
```python from datasets import load_dataset
ds = load_dataset("princeton-nlp/SWE-bench_Lite", split="test") preds = [] for instance in ds: try: patch = solve_instance(instance) except Exception as e: patch = "" preds.append({"instance_id": instance["instance_id"], "model_patch": patch, "model_name_or_path": "swe-agent-sonnet45"})
with open("preds.jsonl", "w") as f: for p in preds: f.write(json.dumps(p) + "\n") ```
Step four — score:
```bash python -m swebench.harness.run_evaluation \ --predictions_path preds.jsonl \ --max_workers 8 \ --run_id sonnet45-lite-2026-05-06 \ --dataset_name princeton-nlp/SWE-bench_Lite \ --split test ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The harness writes a report JSON with `resolved`, `unresolved`, `error`, and (critically) `empty_patch` counts. Quote the resolved-rate, but read all four in the post-mortem; a high empty-patch rate usually means your agent is timing out or hitting tool-call limits before producing a diff.
Cost Reality
Numbers from a real run, last month, on Lite (300 instances, `claude-sonnet-4-5-20251022`, 8 parallel workers, 20-turn budget per instance, 8-tool-call median):
| Item | Cost |
|---|---|
| Anthropic API (input + output tokens, ~3.4M tokens) | $54.20 |
| Compute (Modal, 300 × ~4 min sandbox time × 0.5 vCPU) | $11.80 |
| Storage + image pulls (one-time amortized) | $2.00 |
| Total for one Lite pass | ~$68 |
Verified at the same settings clocks in at roughly $130–$160. Full SWE-bench at `claude-sonnet-4-5-20251022` settings I have seen between $480 and $1100 depending on agent loop length. If your agent has a 50-turn budget instead of 20, multiply by ~2.
If you are submitting to the public leaderboard, do Verified. If you are choosing a model for production, run Verified once, then run Lite weekly as your tracking eval. Running full SWE-bench is mostly defensible only as a marketing exercise.
Why You Also Need an Internal Benchmark
This is the part most teams skip and then regret. SWE-bench's test repos are a specific slice of Python OSS — heavy in scientific/web frameworks, light in proprietary patterns, completely missing from your codebase's idioms. A model that scores 55% on Verified might score 30% on your repo because:
- Your repo uses internal libraries the model has never seen.
- Your repo's test patterns differ (custom test runners, fixtures, mocks).
- Your bugs are more about config and integration than algorithmic correctness.
The fix is to mine your own. Recipe:
- Pick a 6-month window of merged PRs in your repo.
- Filter to PRs that closed an issue and added at least one new test in the same diff.
- For each, snapshot the parent commit and isolate the bug-fix diff and the new tests.
- Hide the fix; treat the issue text + new tests as the eval instance.
- Wrap each instance in a Docker image with your repo's dependencies pinned.
We did this for the voice and chat agent platform backend and ended up with 84 instances. Cost to run it: roughly $14 per pass on Sonnet 4.5. Signal: an order of magnitude more relevant than SWE-bench for our actual model selection. The same pattern is what we use across the healthcare, real estate, sales, salon, IT helpdesk, and after-hours verticals — each vertical's agent gets its own benchmark mined from its own commit history.
This is also what frontier labs do internally. You will not see it in announcements because it is proprietary, but every serious model team I have spoken to has a private code benchmark larger than the public ones they quote.
How SWE-bench Compares to Other Public Code Benchmarks
For context, here is the landscape as of May 2026. Numbers are approximate state-of-the-art for the better-positioned frontier models; treat them as order-of-magnitude.
| Benchmark | Format | Granularity | SOTA pass-rate (frontier model) | What it measures |
|---|---|---|---|---|
| HumanEval | Function completion | ~150 problems | ~95% | Basic syntax + algorithmic reasoning |
| MBPP | Function completion | ~1,000 problems | ~92% | Python idioms, broader vocab than HumanEval |
| LiveCodeBench | Competitive programming | Continuously updated | ~50–60% | Algorithmic depth, contamination-resistant |
| BigCodeBench | API-heavy function tasks | ~1,140 problems | ~50–55% | Library usage breadth |
| SWE-bench Lite | Single-file PR fix | 300 instances | ~50–65% | Codebase navigation + minimal patching |
| SWE-bench Verified | Real PR fix (verified solvable) | 500 instances | ~55–70% | Realistic OSS bug-fixing |
| Internal mined benchmark | Real PR fix in your repo | 50–500 instances | varies | What actually predicts your prod quality |
HumanEval and MBPP are saturated; in 2026 they tell you nothing about a frontier-model's coding ability that you did not already know. LiveCodeBench is useful because it is contamination-resistant (problems are pulled from contests after the model's training cutoff). BigCodeBench is the one to watch if your agent is library-heavy. SWE-bench Verified is the public number that maps closest to "can this model do real engineering work." Your internal benchmark is the only one that maps to your engineering work.
Operational Lessons
Things I learned the hard way running this monthly for a year:
- Pin the harness commit. SWE-bench's harness gets updates that occasionally re-grade existing instances. Pin to a specific tag so your historical numbers stay comparable.
- Pin the model snapshot. `claude-sonnet-4-5-20251022`, not `claude-sonnet-4-5`. Same reason as our continuous evaluation gate for non-code agents.
- Cap turns and tokens per instance. Without a cap, a misbehaving agent will blow your API budget on one stuck instance. We cap at 20 turns and 200K tokens per instance.
- Run with `--max_workers 8` minimum. Serial runs take 6+ hours on Lite. Eight parallel workers brings it to ~45 minutes and the harness handles isolation cleanly.
- Save the full trajectory, not just the patch. When an instance fails, you want to know whether the agent never looked at the right file or looked at it and made a wrong edit. The trajectory is your post-mortem material. The principles in our trace-to-production-fix workflow apply.
- Decompose the failure modes. "Resolved 50%" is a number. "Of the 50% unresolved, 18% never produced a patch, 22% produced a patch that broke other tests, 10% produced a patch that did nothing" is a roadmap.
- Do not iterate the agent on the eval set. If you tune your prompts and tools to fit Verified, you have overfit to a specific 500 instances. Iterate on Lite or on a held-out portion of your internal benchmark; report Verified once per release candidate.
Honest Tradeoffs
SWE-bench is the best public coding-agent benchmark we have, and it has known limits:
- Python-only. If your agent works on TypeScript, Go, or Rust, SWE-bench is silent. SWE-bench-multilingual exists but is much smaller and less mature.
- OSS-shaped problems. Real proprietary work has more config drift, more cross-service reasoning, more "the bug is actually in someone else's repo." None of that is in SWE-bench.
- Heavily contaminated by 2026. Every frontier model has likely seen these repos and these issues during training. The pass-rate is a real signal but it is an upper bound on what a fresh-eyes agent could do.
- The harness is heavy. First-time setup takes a day. Image pulls are bandwidth-intensive. Plan capacity.
The right disposition: SWE-bench Verified is a credibility floor. If a vendor cannot quote a Verified number for their coding agent, ask why. But the number you should make decisions on is the one you ran yourself, on your own repo, at your own cost.
Frequently Asked Questions
Should I report Lite numbers externally?
Generally no. Use Lite as a fast inner loop. Report Verified for any external claim. The community has converged on Verified as the credible benchmark, and a Lite-only number reads as either uninformed or evasive.
How long does a Verified run take?
Roughly 90–150 minutes wall-clock with 8 workers and a 20-turn budget per instance. If you are getting under 60 minutes, your agent is probably timing out a lot — check the empty-patch rate.
Can I use SWE-bench to compare framework choices (Agents SDK vs LangGraph)?
You can, but the framework rarely matters at this granularity. The dominant variables are: model, agent loop design (turn budget, tool set), and prompt. We have run the same agent under both OpenAI Agents SDK and LangGraph and seen Verified scores within 1.5 points of each other. Pick the framework that matches your team's existing patterns; do not pick it based on benchmark theater.
What about agentic harnesses like Aider, OpenHands, or SWE-agent?
Those are full-featured open-source coding-agent harnesses with their own scaffolding (multi-step planning, repo maps, custom prompts). They post the strongest public Verified scores in part because they encode a lot of inductive bias about the task. If you are building a coding agent, study their prompts and trajectories — they are some of the best free design references in the field.
Do I need a GPU?
No. SWE-bench harness containers run on CPU. The model lives behind the API. The bottleneck is API throughput and Docker startup, not local compute.
How does this connect to the rest of agent eval?
Coding-agent eval is a special case of the broader execution-based-eval philosophy in our code-writing agent guide, which itself is a special case of the closed-loop trace-driven workflow we run for all CallSphere agents in our trace-to-fix piece. Same instinct, three different scopes: function, PR, and live conversation. The ones that ship use all three.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.