By Sagar Shankaran, Founder of CallSphere
A practical guide to running SWE-bench (and it Verified / Lite) on your own coding agent, plus the cheaper internal benchmarks that actually move the needle.
Key takeaways
Every frontier-lab announcement in the last 18 months — Claude 3.5 Sonnet, GPT-4.1, GPT-5, Claude Sonnet 4.5 — quotes a SWE-bench number. Most engineering teams I talk to have never actually run it. The bar is lower than you think: a working SWE-bench-Lite run on a custom agent fits in an afternoon and a few hundred dollars of compute. The bar to do it responsibly — with the right harness, the right cost ceiling, and the right interpretation — is higher. This post walks through what SWE-bench actually measures, how Lite differs from Verified differs from full, the working harness invocation, a custom-agent submission shape using the OpenAI Agents SDK and a LangGraph variant, and — critically — why you should also be running a private internal benchmark mined from your own repo's PRs. The internal benchmark is free to run, more representative of your production code, and what frontier labs almost certainly run alongside the public number.
SWE-bench is a benchmark of real GitHub issues and PRs from large open-source Python projects (django, sympy, sphinx, scikit-learn, flask, requests, matplotlib, astropy, xarray, pylint, pytest, seaborn). Each instance gives the agent:
The agent must produce a patch that, when applied, makes the hidden tests pass without breaking the existing tests. The grading is execution-based — same philosophy as our execution-based code eval guide, just at PR scale instead of function scale.
There are three variants you will see quoted:
| Variant | Instances | Difficulty | Typical full-suite cost | When to use |
|---|---|---|---|---|
| SWE-bench (full) | 2,294 | Mixed; some unsolvable | $400–$2000 | Public leaderboard chasing |
| SWE-bench Verified | 500 | Human-verified solvable | $80–$400 | Real model comparison |
| SWE-bench Lite | 300 | Filtered to file-localized fixes | $40–$200 | Internal iteration |
Use Verified, not full. The original SWE-bench has roughly 8–12% of instances that the dataset authors later flagged as ambiguous, broken, or actually unsolvable from the issue text alone (e.g., the "ground truth" PR depended on context not in the issue). SWE-bench Verified is the OpenAI-led human-curated subset that strips those out. Every credible 2025+ frontier-lab announcement quotes Verified for this reason. Quoting full-suite numbers in 2026 is a tell that someone has not been paying attention.
Lite is for development, not for marketing. Lite filters to instances where the fix touches one file. That makes it cheaper and faster, but it systematically rewards agents that are good at single-file edits and underweights agents that are good at cross-file reasoning. Use it as your inner-loop eval; never compare across labs on it.
```mermaid flowchart TD A[Pick instances: Verified or Lite] --> B[For each instance: clone repo at base commit] B --> C[Spin Docker container with repo + deps] C --> D[Pass issue text to your agent] D --> E[Agent reads files, edits files, writes patch] E --> F[Apply patch to repo] F --> G[Run hidden test suite] G --> H{Did target tests pass?} H -->|yes + no regressions| I[Mark RESOLVED] H -->|tests still fail| J[Mark FAIL] H -->|broke other tests| K[Mark REGRESSION] I --> L[Aggregate: % resolved] J --> L K --> L style I fill:#cfc style J fill:#fcc style K fill:#fcc ```
Figure 1 — The official harness. Each instance is a sealed Docker container. The agent never sees the hidden tests. The grade is binary per instance.
The official harness is the `SWE-bench/SWE-bench` repository. It does three things you should not try to reimplement:
You implement one thing: a function that, given an instance, produces a unified diff patch.
The submission contract is simple: produce a JSONL where each line is `{"instance_id": "...", "model_patch": "diff --git ..."}`. Then the harness scores it.
Step one — install and pull instance images:
```bash pip install swebench python -m swebench.harness.run_evaluation \ --predictions_path /dev/null \ --max_workers 8 \ --run_id setup-pull \ --dataset_name princeton-nlp/SWE-bench_Lite \ --split test \ --instance_ids astropy__astropy-12907 django__django-11099 ```
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
That triggers the harness to fetch the prebuilt Docker images. On a first run this is the longest step (~40 minutes for Lite on a fast pipe).
Step two — your agent. Here is the OpenAI Agents SDK shape; the LangGraph shape is structurally identical (one tool node, one model node, one conditional edge). The agent runs inside the instance's Docker container so file edits land where the harness expects them.
```python import json, subprocess from pathlib import Path from agents import Agent, Runner, function_tool
REPO = Path("/testbed") # the harness mounts the repo here
@function_tool def list_files(directory: str = ".") -> str: """List files in a directory, relative to repo root.""" p = REPO / directory return "\n".join(sorted(str(f.relative_to(REPO)) for f in p.rglob("*") if f.is_file()))
@function_tool def read_file(path: str) -> str: """Read a file relative to repo root.""" return (REPO / path).read_text()
@function_tool def write_file(path: str, content: str) -> str: """Overwrite a file relative to repo root.""" (REPO / path).write_text(content) return f"wrote {len(content)} bytes to {path}"
@function_tool def run_tests(test_path: str = "") -> str: """Run the project's test suite. Empty test_path = run all.""" cmd = ["python", "-m", "pytest", "-x", "-q"] if test_path: cmd.append(test_path) p = subprocess.run(cmd, cwd=REPO, capture_output=True, text=True, timeout=300) return f"exit={p.returncode}\n{p.stdout[-4000:]}\n{p.stderr[-2000:]}"
swe_agent = Agent( name="swe-agent", model="claude-sonnet-4-5-20251022", instructions=( "You are fixing a bug in a real Python repository. " "Read the issue, explore the relevant files, make a minimal patch, " "run the test suite, iterate up to 8 times. Do not edit tests. " "When done, return the final summary." ), tools=[list_files, read_file, write_file, run_tests], )
def solve_instance(instance: dict) -> str: prompt = f"Repo: {instance['repo']}\nIssue:\n{instance['problem_statement']}" Runner.run_sync(swe_agent, prompt, max_turns=20) diff = subprocess.check_output(["git", "diff"], cwd=REPO, text=True) return diff ```
Step three — the prediction file:
```python from datasets import load_dataset
ds = load_dataset("princeton-nlp/SWE-bench_Lite", split="test") preds = [] for instance in ds: try: patch = solve_instance(instance) except Exception as e: patch = "" preds.append({"instance_id": instance["instance_id"], "model_patch": patch, "model_name_or_path": "swe-agent-sonnet45"})
with open("preds.jsonl", "w") as f: for p in preds: f.write(json.dumps(p) + "\n") ```
Step four — score:
```bash python -m swebench.harness.run_evaluation \ --predictions_path preds.jsonl \ --max_workers 8 \ --run_id sonnet45-lite-2026-05-06 \ --dataset_name princeton-nlp/SWE-bench_Lite \ --split test ```
The harness writes a report JSON with `resolved`, `unresolved`, `error`, and (critically) `empty_patch` counts. Quote the resolved-rate, but read all four in the post-mortem; a high empty-patch rate usually means your agent is timing out or hitting tool-call limits before producing a diff.
Numbers from a real run, last month, on Lite (300 instances, `claude-sonnet-4-5-20251022`, 8 parallel workers, 20-turn budget per instance, 8-tool-call median):
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
| Item | Cost |
|---|---|
| Anthropic API (input + output tokens, ~3.4M tokens) | $54.20 |
| Compute (Modal, 300 × ~4 min sandbox time × 0.5 vCPU) | $11.80 |
| Storage + image pulls (one-time amortized) | $2.00 |
| Total for one Lite pass | ~$68 |
Verified at the same settings clocks in at roughly $130–$160. Full SWE-bench at `claude-sonnet-4-5-20251022` settings I have seen between $480 and $1100 depending on agent loop length. If your agent has a 50-turn budget instead of 20, multiply by ~2.
If you are submitting to the public leaderboard, do Verified. If you are choosing a model for production, run Verified once, then run Lite weekly as your tracking eval. Running full SWE-bench is mostly defensible only as a marketing exercise.
This is the part most teams skip and then regret. SWE-bench's test repos are a specific slice of Python OSS — heavy in scientific/web frameworks, light in proprietary patterns, completely missing from your codebase's idioms. A model that scores 55% on Verified might score 30% on your repo because:
The fix is to mine your own. Recipe:
We did this for the voice and chat agent platform backend and ended up with 84 instances. Cost to run it: roughly $14 per pass on Sonnet 4.5. Signal: an order of magnitude more relevant than SWE-bench for our actual model selection. The same pattern is what we use across the healthcare, real estate, sales, salon, IT helpdesk, and after-hours verticals — each vertical's agent gets its own benchmark mined from its own commit history.
This is also what frontier labs do internally. You will not see it in announcements because it is proprietary, but every serious model team I have spoken to has a private code benchmark larger than the public ones they quote.
For context, here is the landscape as of May 2026. Numbers are approximate state-of-the-art for the better-positioned frontier models; treat them as order-of-magnitude.
| Benchmark | Format | Granularity | SOTA pass-rate (frontier model) | What it measures |
|---|---|---|---|---|
| HumanEval | Function completion | ~150 problems | ~95% | Basic syntax + algorithmic reasoning |
| MBPP | Function completion | ~1,000 problems | ~92% | Python idioms, broader vocab than HumanEval |
| LiveCodeBench | Competitive programming | Continuously updated | ~50–60% | Algorithmic depth, contamination-resistant |
| BigCodeBench | API-heavy function tasks | ~1,140 problems | ~50–55% | Library usage breadth |
| SWE-bench Lite | Single-file PR fix | 300 instances | ~50–65% | Codebase navigation + minimal patching |
| SWE-bench Verified | Real PR fix (verified solvable) | 500 instances | ~55–70% | Realistic OSS bug-fixing |
| Internal mined benchmark | Real PR fix in your repo | 50–500 instances | varies | What actually predicts your prod quality |
HumanEval and MBPP are saturated; in 2026 they tell you nothing about a frontier-model's coding ability that you did not already know. LiveCodeBench is useful because it is contamination-resistant (problems are pulled from contests after the model's training cutoff). BigCodeBench is the one to watch if your agent is library-heavy. SWE-bench Verified is the public number that maps closest to "can this model do real engineering work." Your internal benchmark is the only one that maps to your engineering work.
Things I learned the hard way running this monthly for a year:
SWE-bench is the best public coding-agent benchmark we have, and it has known limits:
The right disposition: SWE-bench Verified is a credibility floor. If a vendor cannot quote a Verified number for their coding agent, ask why. But the number you should make decisions on is the one you ran yourself, on your own repo, at your own cost.
Generally no. Use Lite as a fast inner loop. Report Verified for any external claim. The community has converged on Verified as the credible benchmark, and a Lite-only number reads as either uninformed or evasive.
Roughly 90–150 minutes wall-clock with 8 workers and a 20-turn budget per instance. If you are getting under 60 minutes, your agent is probably timing out a lot — check the empty-patch rate.
You can, but the framework rarely matters at this granularity. The dominant variables are: model, agent loop design (turn budget, tool set), and prompt. We have run the same agent under both OpenAI Agents SDK and LangGraph and seen Verified scores within 1.5 points of each other. Pick the framework that matches your team's existing patterns; do not pick it based on benchmark theater.
Those are full-featured open-source coding-agent harnesses with their own scaffolding (multi-step planning, repo maps, custom prompts). They post the strongest public Verified scores in part because they encode a lot of inductive bias about the task. If you are building a coding agent, study their prompts and trajectories — they are some of the best free design references in the field.
No. SWE-bench harness containers run on CPU. The model lives behind the API. The bottleneck is API throughput and Docker startup, not local compute.
Coding-agent eval is a special case of the broader execution-based-eval philosophy in our code-writing agent guide, which itself is a special case of the closed-loop trace-driven workflow we run for all CallSphere agents in our trace-to-fix piece. Same instinct, three different scopes: function, PR, and live conversation. The ones that ship use all three.
This guide is written for engineers and operators evaluating swe-bench in real production systems. Swe-bench sits alongside ai models, issue description, language models, models ability, real world software in the daily work of teams shipping production AI. The notes below give a plain-language reference for terms used throughout the article.
For teams that want to ship swe-bench in voice and chat agents this quarter, CallSphere runs 37 agents and 90+ function tools across 6 verticals on a single dashboard. Start a 14-day trial, see live demo agents, or compare tiers on /pricing.
Written by
Sagar Shankaran· Founder, CallSphere
Sagar Shankaran is the founder of CallSphere, where he builds production AI voice and chat agents deployed across healthcare, hospitality, real estate, and home services. He writes about agentic AI, LLM engineering, and shipping voice agents that handle real calls in production.
See how AI voice agents work for your industry. Live demo available -- no signup required.
How we built a fault-tolerant HVAC emergency triage and tech-dispatch platform on Kubernetes — three-tier CQRS, 11 micro-agents on the OpenAI Agents SDK + LangGraph, NATS JetStream, DTMF/SMS/WebSocket acceptance, circuit breakers, and an evaluation pipeline that catches regressions before they wake a tech at 3 AM.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...
Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...
Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.
DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...
© 2026 CallSphere LLC. All rights reserved.
Watch how CallSphere handles real customer calls, schedules appointments, and processes payments — live.
Try Live DemoBook a DemoCalculate Your ROI