A real Claude agent build: from problem to shipped
A realistic end-to-end Claude agent build — from a messy triage problem through evals, code, and shadow mode to a gated production rollout. Real tradeoffs.
Most agent writing lives in the abstract: principles, patterns, diagrams. This post does the opposite. It walks one realistic build end to end — a support-triage agent — from the messy original problem through the design decisions, the code, the evals, the things that broke, and what "shipped" actually meant. The product is invented, but every decision is the kind you really face when you build an agent on Claude in 2026.
Key takeaways
- Start from a narrow, measurable problem, not "build a support agent" — scope is the difference between shipping and stalling.
- Build the simplest single-agent version first; only the demonstrated limit justifies more complexity.
- Write the eval set before the agent so "done" is defined and regressions are caught automatically.
- Expose tools via MCP with tight schemas; the agent is only as reliable as the tools you give it.
- "Shipped" means gated rollout — shadow mode, then a small traffic slice, then full — never a big-bang launch.
The problem: a drowning triage queue
The team runs a support desk where every incoming ticket has to be read, categorized, tagged with priority, and routed to the right queue. A human does this today, and it takes the first 90 seconds of every ticket. That is the whole problem — not "answer support tickets," just triage them. Narrowing to triage is the most important decision in the entire build, because it makes success measurable: did the agent assign the same category and priority a senior agent would?
The target is concrete: match a human's category on most tickets, never silently downgrade an urgent one, and hand off anything it is unsure about. That target becomes the eval.
How does the build flow from problem to production?
Before any code, it helps to see the full arc. The build is not "write a prompt and deploy" — it is a loop that ends at a gated rollout.
flowchart TD
A["Narrow the problem: triage only"] --> B["Write eval set: 30 labeled tickets"]
B --> C["Build single-agent v1 + 2 tools"]
C --> D{"Eval pass rate OK?"}
D -->|No| E["Fix context / tools / prompt"]
E --> D
D -->|Yes| F["Shadow mode on live traffic"]
F --> G{"Matches humans in shadow?"}
G -->|Yes| H["Roll out to 10% then 100%"]
G -->|No| E
Notice the eval loop sits in the middle and the agent only reaches real users after it has matched humans in shadow mode. That ordering is what separates a shipped agent from a demo.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The eval set, written first
We pulled thirty real tickets, had a senior agent label each with category and priority, and froze that as the gold set. The eval runs the agent over all thirty and scores agreement. Critically, it weights one failure heavily: marking an urgent ticket as low priority is far worse than the reverse.
def score(predicted, gold):
cat_ok = predicted["category"] == gold["category"]
# asymmetric: downgrading urgent is a hard fail
if gold["priority"] == "urgent" and predicted["priority"] != "urgent":
return {"pass": False, "reason": "downgraded_urgent"}
pri_ok = predicted["priority"] == gold["priority"]
return {"pass": cat_ok and pri_ok}
results = [score(run_agent(t["text"]), t["label"]) for t in gold_set]
pass_rate = sum(r["pass"] for r in results) / len(results)
print(f"pass rate: {pass_rate:.0%}")
The asymmetry encodes a real business rule into the eval. Before we wrote a line of agent code, we knew exactly what "good enough to ship" meant: a high overall pass rate and zero downgraded-urgent failures.
v1: one agent, two tools
The first version was deliberately boring — a single Claude Sonnet agent with two MCP tools: get_customer_tier (so a paying customer's ticket gets weighted up) and search_similar_tickets (to ground category choice in precedent). No subagents, no orchestration. The system prompt described the categories, the priority rubric, and the rule to escalate uncertainty rather than guess.
v1 hit a respectable pass rate but failed two of the urgent cases — it didn't reliably pull customer tier before deciding priority. The fix was not a better model; it was context engineering. We restructured the prompt so the agent was instructed to call get_customer_tier first and treat tier as a priority input. Pass rate climbed and the urgent failures went to zero. This is the typical rhythm: the model is rarely the bottleneck; the surrounding wiring is.
Common pitfalls we hit (so you can skip them)
- Scope creep into answering. We were tempted to have the agent also draft replies. That would have doubled the eval surface. Fix: ship triage, prove it, then expand.
- Reaching for multi-agent too early. An early instinct was an orchestrator with a "category agent" and a "priority agent." It was slower, costlier, and no more accurate. Fix: single agent until it provably can't cope.
- Evaluating on the data we built the prompt against. Early pass rates looked great because we'd tuned to those tickets. Fix: hold out a fresh test slice the prompt never saw.
- Big-bang launch. The plan was to flip it on for all tickets. Fix: shadow mode first — run the agent silently alongside humans and compare before it touches routing.
- No uncertainty path. v1 always guessed. Fix: give the agent an explicit "escalate to human" output and reward using it in the eval.
Ship it in 8 steps (the path we took)
- Narrow the problem to one measurable task — here, triage, not answering.
- Collect and human-label a gold set; freeze it as the eval.
- Encode business rules (urgent never downgraded) directly into the scoring.
- Build the simplest single-agent version with the minimum tools.
- Iterate on context and tool wiring until the eval passes, including the hard cases.
- Run in shadow mode against live traffic and compare to humans.
- Roll out to a small traffic slice, watch real traces daily, then expand.
- Keep the eval as a CI gate so future changes can't regress.
Each stage and what it cost
| Stage | Time | Main risk addressed |
|---|---|---|
| Scope & gold set | ~1 day | Building the wrong thing |
| v1 single agent | ~2 days | Over-engineering early |
| Eval-driven iteration | ~3 days | Silent quality gaps |
| Shadow mode | ~1 week | Real-world drift |
| Gated rollout | ~1 week | Blast radius on launch |
An end-to-end agent build is the disciplined progression from a narrowly scoped, measurable problem through an eval-gated single-agent implementation to a shadow-tested, incrementally rolled-out production system. The agent we shipped was unglamorous — one model, two tools, a tight prompt — and that is precisely why it worked.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Frequently asked questions
Why not build the most capable version first?
Because complexity you can't measure is complexity you can't trust. The simplest version that passes a real eval ships sooner and teaches you where the actual limits are.
How big should the eval set be to start?
Thirty to fifty well-labeled cases is enough to catch most regressions and define "done." Grow it every time the agent surprises you in production by adding the failing case.
What does shadow mode actually mean?
The agent runs on real live inputs and produces outputs, but those outputs don't take effect — humans still do the real work. You compare the agent's calls to theirs to validate before granting it control.
When did we know it was ready to ship?
When it matched senior agents on the held-out eval, produced zero downgraded-urgent failures, and tracked human decisions closely through a full week of shadow traffic.
Bringing shipped agents to your phone lines
CallSphere runs this exact playbook for voice and chat — eval-gated agents that triage, answer, and book work across every call and message. See a production build live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.