A real Claude agent build: from problem to shipped

Most agent writing lives in the abstract: principles, patterns, diagrams. This post does the opposite. It walks one realistic build end to end — a support-triage agent — from the messy original problem through the design decisions, the code, the evals, the things that broke, and what "shipped" actually meant. The product is invented, but every decision is the kind you really face when you build an agent on Claude in 2026.

Key takeaways

Start from a narrow, measurable problem, not "build a support agent" — scope is the difference between shipping and stalling.
Build the simplest single-agent version first; only the demonstrated limit justifies more complexity.
Write the eval set before the agent so "done" is defined and regressions are caught automatically.
Expose tools via MCP with tight schemas; the agent is only as reliable as the tools you give it.
"Shipped" means gated rollout — shadow mode, then a small traffic slice, then full — never a big-bang launch.

The problem: a drowning triage queue

The team runs a support desk where every incoming ticket has to be read, categorized, tagged with priority, and routed to the right queue. A human does this today, and it takes the first 90 seconds of every ticket. That is the whole problem — not "answer support tickets," just triage them. Narrowing to triage is the most important decision in the entire build, because it makes success measurable: did the agent assign the same category and priority a senior agent would?

The target is concrete: match a human's category on most tickets, never silently downgrade an urgent one, and hand off anything it is unsure about. That target becomes the eval.

How does the build flow from problem to production?

Before any code, it helps to see the full arc. The build is not "write a prompt and deploy" — it is a loop that ends at a gated rollout.

flowchart TD
  A["Narrow the problem: triage only"] --> B["Write eval set: 30 labeled tickets"]
  B --> C["Build single-agent v1 + 2 tools"]
  C --> D{"Eval pass rate OK?"}
  D -->|No| E["Fix context / tools / prompt"]
  E --> D
  D -->|Yes| F["Shadow mode on live traffic"]
  F --> G{"Matches humans in shadow?"}
  G -->|Yes| H["Roll out to 10% then 100%"]
  G -->|No| E

Notice the eval loop sits in the middle and the agent only reaches real users after it has matched humans in shadow mode. That ordering is what separates a shipped agent from a demo.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

The eval set, written first

We pulled thirty real tickets, had a senior agent label each with category and priority, and froze that as the gold set. The eval runs the agent over all thirty and scores agreement. Critically, it weights one failure heavily: marking an urgent ticket as low priority is far worse than the reverse.

def score(predicted, gold):
    cat_ok = predicted["category"] == gold["category"]
    # asymmetric: downgrading urgent is a hard fail
    if gold["priority"] == "urgent" and predicted["priority"] != "urgent":
        return {"pass": False, "reason": "downgraded_urgent"}
    pri_ok = predicted["priority"] == gold["priority"]
    return {"pass": cat_ok and pri_ok}

results = [score(run_agent(t["text"]), t["label"]) for t in gold_set]
pass_rate = sum(r["pass"] for r in results) / len(results)
print(f"pass rate: {pass_rate:.0%}")

The asymmetry encodes a real business rule into the eval. Before we wrote a line of agent code, we knew exactly what "good enough to ship" meant: a high overall pass rate and zero downgraded-urgent failures.

v1: one agent, two tools

The first version was deliberately boring — a single Claude Sonnet agent with two MCP tools: get_customer_tier (so a paying customer's ticket gets weighted up) and search_similar_tickets (to ground category choice in precedent). No subagents, no orchestration. The system prompt described the categories, the priority rubric, and the rule to escalate uncertainty rather than guess.

v1 hit a respectable pass rate but failed two of the urgent cases — it didn't reliably pull customer tier before deciding priority. The fix was not a better model; it was context engineering. We restructured the prompt so the agent was instructed to call get_customer_tier first and treat tier as a priority input. Pass rate climbed and the urgent failures went to zero. This is the typical rhythm: the model is rarely the bottleneck; the surrounding wiring is.

Common pitfalls we hit (so you can skip them)

Scope creep into answering. We were tempted to have the agent also draft replies. That would have doubled the eval surface. Fix: ship triage, prove it, then expand.
Reaching for multi-agent too early. An early instinct was an orchestrator with a "category agent" and a "priority agent." It was slower, costlier, and no more accurate. Fix: single agent until it provably can't cope.
Evaluating on the data we built the prompt against. Early pass rates looked great because we'd tuned to those tickets. Fix: hold out a fresh test slice the prompt never saw.
Big-bang launch. The plan was to flip it on for all tickets. Fix: shadow mode first — run the agent silently alongside humans and compare before it touches routing.
No uncertainty path. v1 always guessed. Fix: give the agent an explicit "escalate to human" output and reward using it in the eval.

Ship it in 8 steps (the path we took)

Narrow the problem to one measurable task — here, triage, not answering.
Collect and human-label a gold set; freeze it as the eval.
Encode business rules (urgent never downgraded) directly into the scoring.
Build the simplest single-agent version with the minimum tools.
Iterate on context and tool wiring until the eval passes, including the hard cases.
Run in shadow mode against live traffic and compare to humans.
Roll out to a small traffic slice, watch real traces daily, then expand.
Keep the eval as a CI gate so future changes can't regress.

Each stage and what it cost

Stage	Time	Main risk addressed
Scope & gold set	~1 day	Building the wrong thing
v1 single agent	~2 days	Over-engineering early
Eval-driven iteration	~3 days	Silent quality gaps
Shadow mode	~1 week	Real-world drift
Gated rollout	~1 week	Blast radius on launch

An end-to-end agent build is the disciplined progression from a narrowly scoped, measurable problem through an eval-gated single-agent implementation to a shadow-tested, incrementally rolled-out production system. The agent we shipped was unglamorous — one model, two tools, a tight prompt — and that is precisely why it worked.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Frequently asked questions

Why not build the most capable version first?

Because complexity you can't measure is complexity you can't trust. The simplest version that passes a real eval ships sooner and teaches you where the actual limits are.

How big should the eval set be to start?

Thirty to fifty well-labeled cases is enough to catch most regressions and define "done." Grow it every time the agent surprises you in production by adding the failing case.

What does shadow mode actually mean?

The agent runs on real live inputs and produces outputs, but those outputs don't take effect — humans still do the real work. You compare the agent's calls to theirs to validate before granting it control.

When did we know it was ready to ship?

When it matched senior agents on the held-out eval, produced zero downgraded-urgent failures, and tracked human decisions closely through a full week of shadow traffic.

Bringing shipped agents to your phone lines

CallSphere runs this exact playbook for voice and chat — eval-gated agents that triage, answer, and book work across every call and message. See a production build live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

A real Claude agent build: from problem to shipped

Key takeaways

The problem: a drowning triage queue

How does the build flow from problem to production?

The eval set, written first

v1: one agent, two tools

Common pitfalls we hit (so you can skip them)

Ship it in 8 steps (the path we took)

Each stage and what it cost

Frequently asked questions

Why not build the most capable version first?

How big should the eval set be to start?

What does shadow mode actually mean?

When did we know it was ready to ship?

Bringing shipped agents to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild