Multi-agent walkthrough: from messy problem to shipped agent

Architecture diagrams make multi-agent systems look clean. Real builds are messy: a vague problem, an unclear owner, a tool that doesn't quite return what you need, and an eval suite you keep meaning to write. This post walks through a realistic end-to-end build on Claude — not a toy, but the kind of internal system a mid-sized engineering org actually ships — so you can see every decision in the order it really gets made.

The scenario: a B2B software company's support team is drowning in inbound technical tickets. Many are answerable from existing docs, logs, and past tickets, but each one requires hopping across three systems and reading carefully. Leadership wants an agent that triages and drafts resolutions, leaving humans to approve and send. We'll build it from that fuzzy ask to a shipped system.

Step 1: turn the fuzzy ask into a scoped problem

The first job is not engineering — it's scoping. "Build an agent that handles support" is unshippable. We narrow it: the agent will read an incoming ticket, gather relevant context from docs, recent logs, and similar past tickets, and produce a draft resolution plus a confidence signal. It will not send anything to customers; a human approves. That single constraint — draft, don't send — collapses most of the risk and makes the project shippable in weeks instead of quarters.

We also define what success means before writing a line: a draft is "good" if a support engineer would send it with minor or no edits. That definition becomes our eval target later, so pinning it down now is doing future-us a favor. Vague success criteria are how agent projects die in endless tuning.

Step 2: decide single-agent or multi-agent

The honest default is single-agent. Multi-agent costs several times more tokens and adds coordination complexity, so you should be able to justify it. Here we can: the work has three genuinely separate research tasks — searching documentation, querying logs, and finding similar tickets — each of which benefits from focused context and can run in parallel. A single agent juggling all three would carry a bloated context and slow down. So we choose an orchestrator with three specialized subagents, plus a final drafting step.

flowchart TD
  A["Incoming ticket"] --> B["Orchestrator agent"]
  B --> C["Docs subagent (read-only)"]
  B --> D["Logs subagent (read-only)"]
  B --> E["Past-tickets subagent (read-only)"]
  C --> F["Orchestrator synthesizes context"]
  D --> F
  E --> F
  F --> G["Draft + confidence score"]
  G --> H{"Confidence high?"}
  H -->|Yes| I["Queue for human approval"]
  H -->|No| J["Flag for full human handling"]

Note what the topology encodes: the three research subagents are strictly read-only, the orchestrator never touches a customer-facing system, and the only output is a draft a human reviews. The architecture itself carries the safety guarantees we scoped in step one.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step 3: build the tools before the agents

A common mistake is starting with prompts. We start with tools, because an agent's competence is bounded by the quality of what it can call. Using MCP servers, we expose three capabilities: a documentation search that returns ranked passages with source links, a log query that takes a customer ID and time window and returns structured events, and a similar-ticket search over the resolved-ticket archive. Each tool gets a precise description and schema — written the way you'd write a public API, because to Claude that's what it is.

We test each tool in isolation first, calling it directly with realistic inputs and checking the outputs are clean, ranked, and well-described. A surprising share of "the agent is dumb" complaints trace back to a tool that returns noisy or ambiguous data. Fixing the tool fixes the agent. Only once all three tools behave do we let an agent near them.

A concrete example of where this pays off: our first log-query tool returned raw event dumps — hundreds of lines of JSON per request. The agent technically had the data but drowned in it, producing vague drafts that referenced the wrong events. We didn't touch a prompt. We changed the tool to accept a severity filter and return a compact, summarized event list with timestamps and a one-line description each. Draft quality jumped immediately, because the agent now reasoned over signal instead of noise. The lesson generalizes: shaping tool output to what the agent actually needs is often the highest-leverage change you can make, and it's invisible if you only ever look at prompts.

Step 4: wire the orchestrator and subagents

Now the orchestration. The orchestrator's system prompt describes its job: given a ticket, delegate research to the three subagents, synthesize their findings, and produce a draft with a confidence score. Each subagent gets a tight prompt scoped to its one tool and its one job — the docs subagent doesn't know logs exist. This focus keeps each agent's context lean and its behavior predictable.

We run the subagents in parallel since their tasks are independent, then have the orchestrator combine results. The confidence score is derived from concrete signals — did the subagents find directly relevant sources, or did they come back thin? Low confidence routes the ticket straight to a human rather than producing a shaky draft. This is the difference between a system that's helpful and one that quietly wastes reviewers' time on bad drafts.

Step 5: build the eval suite that gates release

Before this touches a real queue, we build the eval. We assemble a dataset of a few dozen historical tickets where we know the correct resolution. We run the system over them and grade the drafts two ways: a deterministic check that the draft cites real sources, and an LLM-as-judge grade comparing the draft to the known-good resolution. We also track how often the confidence routing is right — does it escalate the genuinely hard tickets?

This eval becomes a gate. We set a bar — say, a strong majority of drafts judged send-ready with minor edits — and the system doesn't ship until it clears it. More importantly, the eval runs in CI, so any future prompt or tool change that regresses quality gets caught before it reaches production. Without this, you have a system that works the day you build it and silently degrades after.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step 6: ship narrow, observe, then widen

We don't flip it on for all tickets. We launch on one product area with one supportive team, with every run producing a full transcript and a dashboard tracking draft acceptance rate, escalation accuracy, and token spend per ticket. For the first weeks, a human reviews every draft and we read transcripts daily, feeding what we learn back into prompts and tool descriptions.

As the acceptance rate proves out and the failure modes become familiar, we widen to more product areas and more of the team. We never grant the agent send access — the human-approval boundary stays. What started as a fuzzy "handle support with AI" is now a shipped, evaluated, observable multi-agent system that measurably saves reviewer time. The path there ran through scoping and tools and evals far more than through clever prompting.

One last lesson from this build worth carrying forward: nearly every hard problem showed up before the model did. The fuzzy scope, the noisy log tool, the missing definition of success, the absent eval — none of those are model limitations, and none would have been fixed by a better prompt. They were engineering and product decisions. The model was the easy part. Internalizing that ordering is what separates teams who ship reliable agents from teams who stay stuck tuning prompts on a foundation that was never going to hold. Build the foundation first, and the model will do its job.

Frequently asked questions

How do you decide whether a problem needs multiple agents?

Default to a single agent and require justification for more. Multi-agent makes sense when the work splits into genuinely separate subtasks that benefit from focused context or parallel execution — like distinct research streams. If one agent with a clear prompt and good tools can do the job, the extra token cost and coordination complexity of multi-agent aren't worth it.

Why build tools before agents?

Because an agent can only be as good as the tools it calls. Noisy, ambiguous, or poorly described tools cause more agent failures than weak reasoning does. Building and testing each tool in isolation first means that when you wire up agents, you're debugging orchestration rather than chasing data-quality problems disguised as model problems.

What makes the eval suite the most important step?

The eval defines and enforces what "working" means, and it runs in CI so quality can't silently regress. Without it, you have a system that's right the day you ship it and unknowable thereafter. With it, every change is gated against real tasks, and you can widen rollout based on measured acceptance rather than optimism.

From shipped agents to live conversations

CallSphere takes this same scope-tools-eval-ship discipline to voice and chat — multi-agent assistants that research mid-call and hand off cleanly to humans. See a production build at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Multi-agent walkthrough: from messy problem to shipped agent

Step 1: turn the fuzzy ask into a scoped problem

Step 2: decide single-agent or multi-agent

Step 3: build the tools before the agents

Step 4: wire the orchestrator and subagents

Step 5: build the eval suite that gates release

Step 6: ship narrow, observe, then widen

Frequently asked questions

How do you decide whether a problem needs multiple agents?

Why build tools before agents?

What makes the eval suite the most important step?

From shipped agents to live conversations

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild