Building a Claude Support Agent: End-to-End Walkthrough
A realistic startup case study from support backlog to a shipped Claude agent — architecture, evals, and a staged shadow-to-auto rollout.
Most agent tutorials stop at a toy example. This one follows a realistic build from the actual starting point a startup faces: a real problem, a messy reality, constraints on time and money, and the unglamorous decisions that determine whether the thing ships. The scenario is a 12-person SaaS startup drowning in support tickets, and the goal is a Claude agent that resolves a meaningful slice of them without making things worse. We will walk the whole arc — problem, design, build, eval, rollout — so you can see how the pieces fit.
The problem, stated honestly
Support volume has tripled with growth. Two support staff are buried, first-response time has crept past a day, and the founders keep getting pulled in to firefight. About 40% of tickets are repetitive: password and access issues, billing questions answerable from the account record, and how-do-I questions answered in the docs. The team does not need an agent that handles everything. They need one that confidently closes the boring 40% and cleanly hands off the rest.
That framing is the most important decision in the whole project. A narrow, well-bounded job with a clear success bar — resolve common tickets correctly or escalate — is achievable. "Replace support" is not. The team writes down the target explicitly: for in-scope tickets, the agent should produce a resolution a support lead would send unedited, or escalate with a clear summary.
Designing the architecture
The team builds on the Claude Agent SDK so they get the agent loop, tool calling, and approval hooks without reinventing them. The design centers on what the agent can see and do. It needs read access to the customer's account and ticket history, read access to the docs, the ability to draft and send a reply, and the ability to escalate. Crucially, it does not get write access to billing or the ability to issue refunds in version one — those are blast radius the team is not ready to hand over.
flowchart TD
A["New ticket arrives"] --> B["Agent reads account + history via MCP"]
B --> C{"In-scope & confident?"}
C -->|No| D["Escalate with summary to human"]
C -->|Yes| E["Search docs skill for answer"]
E --> F["Draft reply with Claude"]
F --> G{"Sensitive action involved?"}
G -->|Yes| D
G -->|No| H["Send reply, tag resolved, log trace"]
The MCP layer exposes three scoped servers: a read-only account/orders server, a read-only docs server backing a search Skill, and a messaging server with a single send-reply tool. The system prompt sets the agent's identity, tone, and hard rules: never guess at billing specifics, always escalate anything touching money or account deletion, and never claim certainty it does not have. The confidence gate is the safety valve — when the agent is unsure, escalation is the correct answer, not a guess.
Building the first version
The build itself is fast, which surprises the team. With the Agent SDK handling the loop, the real work is in three areas. Writing the system prompt and iterating on it against real tickets. Building the docs Skill so the agent retrieves the right help-center article and quotes it accurately. And wiring the escalation path so a handoff includes a clean summary the human can act on in seconds.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The first version is deliberately conservative. The confidence threshold is set high, so early on the agent escalates more than it resolves. That is the right starting posture: a support agent that escalates too much is annoying; one that confidently sends wrong answers to customers is a brand problem. The team would rather earn trust by being right and tune the threshold down as evals prove it can handle more.
The eval suite that gates the release
Before a single real customer sees an agent reply, the team builds an eval set from history. They pull a few hundred past tickets, including hard and ambiguous ones, and have a support lead label the correct outcome for each: the ideal resolution or "should escalate." They then run the agent against this fixed set and score it. For resolution quality they use Claude as a judge with a strict rubric, plus human spot-checks on a sample.
The metrics that gate the release are concrete. Resolution accuracy on in-scope tickets must clear a high bar. The false-resolution rate — confidently sending a wrong answer — must be near zero, because that is the failure that damages trust. And the escalation handoffs must be judged useful by the support lead. For a citable definition: an agent eval suite is a fixed collection of representative tasks with labeled correct outcomes, run repeatedly to measure whether an agent meets a defined quality bar before and after every change.
Rollout: shadow, then assist, then act
The team does not flip a switch. They roll out in three stages. In shadow mode, the agent processes real tickets but its replies go to the support team, not the customer; the team compares what the agent would have done to what they did. This surfaces failure patterns safely and builds confidence in the numbers.
In assist mode, the agent drafts replies that a human approves with one click before they send. This keeps a human in the loop while cutting handling time dramatically, and it generates more labeled data. Only after the metrics hold across hundreds of real tickets does the team enable auto-resolve for the highest-confidence, lowest-risk ticket categories — and even then, with full audit logging and a circuit breaker that halts auto-resolve if the false-resolution rate ticks up. The result is the boring 40% handled in seconds, the team focused on the hard tickets, and a system the founders trust because they watched it earn that trust at every stage.
What broke, and how the team fixed it
No real build goes cleanly, and this one did not either. In shadow mode the team found the agent confidently answering billing questions with slightly stale numbers, because the account server cached data that lagged the live system. The fix was not a better prompt — it was tightening the data source and adding a hard rule that any answer involving a current balance must come from a live read or escalate. This is the recurring lesson of agent building: many "model mistakes" are actually data or tooling problems wearing a model costume.
A second issue surfaced in assist mode: the agent's tone drifted toward robotic on emotionally charged tickets — a frustrated customer got a technically correct but cold reply. The support lead flagged it during review, the team added tone guidance and a few labeled examples of empathetic handling to the system prompt, and the edit rate on those tickets dropped. Watching the human edit rate, not just resolution accuracy, is what surfaced the problem early. Each fix went back into the eval suite as a new case, so the agent could never silently regress on it again.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The outcome and what made it stick
By the time auto-resolve was live for the safe categories, first-response time on those tickets dropped from over a day to seconds, the two support staff were spending their time on the genuinely hard cases, and the founders had stopped firefighting. The number that mattered most to them was the false-resolution rate staying near zero in production — proof the agent was not quietly damaging customer trust to inflate its volume.
What made the result durable was the discipline around it, not the cleverness of the agent. The eval suite caught regressions before customers did. The staged rollout meant every expansion of scope was earned with data. The audit trail meant any surprising action could be explained. And the narrow framing — handle the boring 40% well, escalate everything else cleanly — kept the project achievable instead of aspirational. A startup that copies this arc, not the specific tooling, will ship an agent that works.
Frequently asked questions
How long does a build like this take?
The first working version comes together quickly because the Agent SDK handles the agent loop and tool calling. The time-consuming parts are building the eval suite from historical tickets and the staged rollout, which together are weeks of careful work — and they are exactly the parts you should not rush.
Why not let the agent issue refunds in version one?
Refunds are irreversible and high value — large blast radius. The right move is to ship the agent with read access and a send-reply tool, prove it is reliable on safe tickets, and only later add gated, capped financial actions. Earning trust on low-risk work first is how you safely expand scope.
What stops the agent from sending a wrong answer to a customer?
Three layers: a high confidence gate that escalates anything uncertain, an eval suite that keeps the false-resolution rate near zero before release, and a staged rollout (shadow then assist then auto) so humans catch failures before the agent acts autonomously. Capability scoping via MCP keeps it away from sensitive actions entirely.
How do you decide which tickets the agent should auto-resolve?
Start from your data. Categorize historical tickets, find the repetitive, low-risk, well-documented ones, and limit auto-resolution to those. Expand the scope only as evals prove accuracy holds. The agent does not need to handle everything to be valuable — closing the boring, high-volume slice is the win.
From tickets to phone lines
The same problem-to-shipped arc applies to voice. CallSphere builds Claude-powered voice and chat agents that resolve common requests live, use tools mid-call, escalate cleanly, and book work 24/7 — rolled out with the same shadow-then-act discipline. See it at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.