Anatomy of a Claude Agent: Architecture End to End
How a Claude agent is wired internally — model loop, context store, tool router, MCP layer, and control plane — with a diagram, code, and pitfalls.
Most teams discover the hard way that an agent is not a model with a system prompt taped to a few API calls. It is a small distributed system that happens to have a language model at its center. When a Claude agent stalls, loops, or burns through a budget, the cause is almost never the model — it is some seam in the architecture: a context store that grew unbounded, a tool router that returned ambiguous results, or a control loop that never decided when to stop. This post walks the full anatomy of an effective Claude agent so you can reason about each piece independently.
I will use the vocabulary of the Claude ecosystem in 2026 — the model loop, Model Context Protocol (MCP) servers, Agent Skills, and the orchestrator pattern — but the architecture generalizes. The goal is a mental model precise enough that, when something breaks at 2 a.m., you know which box on the diagram to open.
Key takeaways
- A Claude agent decomposes into five planes: model loop, context store, tool router, capability layer (MCP + Skills), and a control plane that owns stopping and budgets.
- The agent loop is just perceive → decide → act → observe repeated until a stop condition fires; most failures are missing stop conditions.
- Context is state, not a transcript — treat it as an explicit store you curate every turn, not an append-only log.
- Tools and MCP servers are the agent's hands; Skills are the instructions that teach Claude when and how to use them.
- The control plane (budgets, retries, guardrails) is what separates a demo from a production agent.
What an agent actually is
An AI agent is a system that uses a language model to choose its own sequence of actions toward a goal, observing the results of each action before deciding the next. The defining word is chooses: a workflow with hard-coded steps is not an agent, even if every step calls Claude. An agent earns the name when the model — not your code — decides what happens next.
Concretely, a Claude agent is a loop wrapped around the Messages API. Each iteration sends the current context to the model, the model responds either with a final answer or a request to call one or more tools, your runtime executes those tool calls, appends the results, and loops again. Everything else in the architecture exists to make that loop reliable, observable, and bounded.
The five planes, end to end
It helps to separate concerns into five planes, each with a single responsibility. The model loop drives decision-making. The context store holds the working state passed to the model. The tool router turns a model's tool request into a concrete execution. The capability layer — MCP servers plus Skills — is the catalog of what the agent can do and how. The control plane enforces budgets, retries, and safety. Here is how a single request flows through them.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Incoming goal"] --> B["Context store assembles state"]
B --> C["Model loop: Claude decides"]
C --> D{"Tool call requested?"}
D -->|No| E["Return final answer"]
D -->|Yes| F["Tool router resolves target"]
F --> G["MCP server / Skill executes"]
G --> H["Observation appended to context"]
H --> I{"Control plane: budget & stop check"}
I -->|Continue| C
I -->|Halt| E
Read the diagram as a cycle with one escape hatch. The control plane sits deliberately after the observation step, because that is the only safe place to ask "should we keep going?" — after you have spent tokens and learned something, before you spend more. Teams that put the budget check at the top of the loop tend to halt agents that were one cheap step from finishing.
The model loop in detail
The model loop is the heart, and its shape is simpler than people expect. On each turn you call Claude with the system prompt, the curated context, and the tool definitions. Claude returns content blocks; a tool_use block is a request, not an execution. Your runtime is responsible for executing it and returning a matching tool_result. Here is the minimal skeleton with the Agent SDK style:
while not done:
resp = client.messages.create(
model="claude-opus-4-8",
system=SYSTEM_PROMPT,
messages=context.render(),
tools=tool_catalog,
max_tokens=4096,
)
if resp.stop_reason == "tool_use":
for block in resp.tool_use_blocks():
result = router.execute(block.name, block.input)
context.add_tool_result(block.id, result)
else:
done = True
budget.charge(resp.usage)
if budget.exceeded() or context.turns > MAX_TURNS:
done = True
Notice that done can be set two ways: Claude decides it is finished, or the control plane forces a stop. Both paths must exist. An agent with only the first path will, on a bad day, loop forever calling the same tool with slightly different arguments.
Context as explicit state
The single biggest architectural lever is treating context as a managed store rather than an ever-growing transcript. With Claude's 1M-token window it is tempting to just append everything, but a bloated context degrades decision quality and cost long before you hit the limit. Effective agents curate: they summarize old tool results, drop stale observations, pin the goal and key constraints, and keep only the last few raw exchanges verbatim.
A good context store exposes operations like pin(fact), summarize(range), and evict(predicate). The control plane calls these between turns. The model never sees your bookkeeping — it sees a clean, relevant working set every time, which is exactly what keeps a long-running agent coherent over dozens of steps.
The control plane is where production lives
If the model loop is the heart, the control plane is the nervous system that keeps the agent from hurting itself. It owns three things no other plane should touch: budgets, retries, and guardrails. A budget is a running tally of tokens and wall-clock time with a hard ceiling. Retries are bounded re-attempts of failed tool calls with backoff, never unlimited. Guardrails are the checks that run before a risky action commits — a confirmation gate on an irreversible write, a rate limit on outbound calls, a deny-list of operations the agent simply may not perform.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The reason to concentrate these in one plane is testability. When budgets, retries, and stops are scattered across prompts and tool bodies, you cannot reason about the agent's worst case. When they live in one control plane, you can write a test that asserts "this agent never spends more than N tokens" and actually trust it. That single property — a provable upper bound on cost and blast radius — is most of what stands between a demo and something you would run against production data.
Observability: design for replay
An agent you cannot replay is an agent you cannot debug. Bake in a trace from the start: a per-run record of every turn's input context, the model's decision, the tool calls, their results, and the token usage. Key it by a run ID and store it somewhere queryable. When an agent does something inexplicable, you do not reason about it abstractly — you open the trace and watch, turn by turn, where the decision went sideways. Almost every "the model is broken" report turns out, on replay, to be a tool that returned ambiguous data or a context that lost the goal.
Common pitfalls
- No stop condition. The most frequent production incident. Always cap turns and total tokens, and add a heuristic for "no progress" (e.g., the same tool called with near-identical input twice).
- Treating context as a log. Appending every raw tool result eventually drowns the goal. Summarize and evict on a schedule.
- Routing ambiguity. If two tools could plausibly satisfy a request, Claude will sometimes pick the wrong one. Give tools sharp, non-overlapping descriptions.
- Silent tool failures. A tool that returns an empty string on error teaches the model nothing. Return structured errors the model can reason about and retry.
- Mixing planes. Putting retry logic inside a tool, budget logic inside the prompt, and stopping logic nowhere makes the system impossible to debug. Keep responsibilities separated.
Build the skeleton in five steps
- Stand up the bare model loop calling the Messages API with one trivial tool and a hard turn cap.
- Add a context store object with explicit
render(),add_tool_result(), andsummarize()methods. - Introduce a tool router that maps tool names to executors and normalizes errors into structured results.
- Attach the capability layer: register MCP servers and load relevant Skills so Claude knows what is available.
- Wrap it all in a control plane that charges a token budget, enforces stop conditions, and logs every turn for replay.
Which plane owns what?
| Concern | Owned by | Anti-pattern if misplaced |
|---|---|---|
| What to do next | Model loop | Hard-coding in your runtime kills agency |
| What the model sees | Context store | Unbounded append degrades quality |
| How a tool runs | Tool router | Logic in the prompt is untestable |
| When to stop | Control plane | No owner means infinite loops |
Frequently asked questions
How is an agent different from a chained workflow?
A workflow has a fixed graph of steps you author; an agent lets the model choose the next step at runtime based on observations. Workflows are more predictable and cheaper; agents handle open-ended tasks. Many production systems are hybrids: deterministic scaffolding with an agentic core for the genuinely uncertain parts.
Where do MCP servers fit in this architecture?
MCP servers live in the capability layer, behind the tool router. The router translates a Claude tool_use request into an MCP call, the server returns structured data, and that data becomes an observation in the context store. Claude never talks to your database directly — it goes through the MCP boundary, which is what makes the system auditable.
Do I need a multi-agent architecture from the start?
Almost never. Multi-agent runs typically consume several times more tokens than a single agent and add coordination complexity. Get the single-agent loop solid first; reach for orchestrator-subagent patterns only when a task genuinely decomposes into parallel, independent subtasks.
Bringing agentic AI to your phone lines
CallSphere applies these same agentic-AI patterns to voice and chat — multi-agent assistants that answer every call and message, use tools mid-conversation, and book work 24/7. See it live at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.