MCP Agent Architecture: How Claude Reaches Production

The first time an agent you built actually modifies a production record, the abstraction stops being academic. A retrieval-augmented chatbot that summarizes documents fails softly; an agent that issues a refund, reschedules a delivery, or closes a ticket fails loudly and expensively. The difference between a demo and a system that can be trusted near production data is almost entirely architectural. This post walks through how a Claude agent that reaches real systems is actually wired together, layer by layer, so you can see where reliability is won or lost.

The popular mental model — "the LLM calls some tools" — hides the parts that matter. In practice there are at least five distinct planes: the model, the agent harness that runs the loop, the protocol that carries tool calls, the servers that expose your systems, and the systems themselves. Each plane has its own failure modes, its own trust boundary, and its own place to put guardrails. Understanding the seams between them is what lets you reason about a production agent instead of hoping it behaves.

The five planes of a production MCP agent

At the top sits the model — Claude Opus 4.8, Sonnet 4.6, or Haiku 4.5 depending on how much reasoning the task needs. The model never touches your database directly. It emits structured intentions: "I want to call create_refund with these arguments." Everything below the model exists to turn that intention into a real effect and feed the result back.

Beneath the model is the agent harness — the loop that sends a prompt, receives the model's tool-call request, dispatches it, captures the result, appends it to the conversation, and re-invokes the model until it produces a final answer. The Claude Agent SDK provides this loop with the ergonomics already worked out: context management, parallel subagents, hooks, and a clean tool-dispatch interface. The harness is where you decide concurrency, retries, timeouts, and which calls require a human in the loop.

The third plane is the protocol. Model Context Protocol is an open standard, introduced by Anthropic in November 2024, that defines how an agent discovers and invokes external tools and data sources through a uniform interface, so the same agent can talk to many systems without bespoke glue for each one. Below the protocol live the MCP servers — small processes that translate a generic tool call into a real action against one specific system: your Postgres database, your Stripe account, your internal HTTP API. The fifth plane is the systems of record themselves, where the actual state lives and where mistakes have consequences.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

How a single request flows end to end

Consider a concrete request: a customer asks an agent to cancel an order and refund the shipping. Trace what happens. The harness builds a prompt containing the system instructions, the available tool schemas (pulled from each connected MCP server at startup), and the user message. Claude reads it and decides it needs to look up the order first, then issue the refund. It emits a tool call. The harness validates the arguments against the schema, dispatches the call over MCP to the orders server, receives structured JSON, and appends it. Claude reads the order state, confirms eligibility, and emits the refund call. Only after the server returns a confirmation does the model compose a human reply.

flowchart TD
  A["User request"] --> B["Agent harness builds prompt + tool schemas"]
  B --> C["Claude emits tool call"]
  C --> D{"Args valid & allowed?"}
  D -->|No| E["Reject, return error to model"]
  D -->|Yes| F["Dispatch over MCP to server"]
  F --> G["Server acts on system of record"]
  G --> H["Structured result appended to context"]
  H --> I{"Task complete?"}
  I -->|No| C
  I -->|Yes| J["Claude composes final answer"]

Notice the validation diamond between the model and the server. That gate is the single most important architectural element when an agent reaches production. The model proposes; the harness disposes. You never let a raw model intention reach a system of record without a deterministic check in between, because the model's judgment, however good, is not a permission system.

Where the trust boundaries sit

Every arrow that crosses a plane is a trust boundary, and each one deserves an explicit policy. The boundary between the model and the harness is where you enforce that only declared tools can be called and only with schema-valid arguments. The boundary between the harness and the MCP server is where authentication lives — the server holds the real credentials, not the model, so a leaked transcript never leaks a database password. The boundary between the server and the system of record is where you scope permissions narrowly: an MCP server for read-only analytics should connect with a read-only role, full stop.

This layering is what makes the architecture defensible. Because credentials live in the server plane and never enter the model's context, prompt injection in a retrieved document cannot exfiltrate them. Because the harness validates every call, a hallucinated tool name or malformed argument is caught before it does anything. Because each server maps to one system with a scoped role, the blast radius of any single compromised component is bounded. Security here is not a feature you bolt on; it is a property of putting the right thing in the right plane.

State, memory, and the context window

A production agent rarely finishes in one turn. It may loop dozens of times, accumulating tool results that quickly fill even a generous context window. Claude Code and the Agent SDK support a 1M-token context, but treating that as infinite is a mistake — long contexts cost money and dilute attention. The architecture therefore needs a context-management strategy: summarize completed sub-tasks, drop verbose intermediate tool output once its conclusion is captured, and keep a compact running state of what has been decided and what remains.

For genuinely large jobs, the cleaner pattern is to push state out of the context entirely. The harness can maintain an explicit task ledger — a structured record of steps, their status, and their results — and feed the model only the relevant slice each turn. This keeps the model focused and makes the run resumable: if a turn fails, you re-hydrate from the ledger rather than replaying the whole conversation. The model becomes a stateless reasoner over an externally managed state machine, which is exactly what you want when correctness matters.

Single agent versus orchestrated subagents

When one agent's responsibilities sprawl, the architecture often splits into an orchestrator and specialized subagents. The orchestrator owns the plan; each subagent owns a narrow domain with its own tools and its own slice of context. Claude Code runs parallel subagents natively, which is powerful — but multi-agent runs typically consume several times more tokens than a single agent, because each subagent carries its own context and the orchestrator pays to coordinate them. Reach for orchestration when sub-tasks are genuinely independent and parallelizable, not as a default.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Architecturally, subagents are the same five planes nested one level down: each has a harness, talks MCP, and respects the same trust boundaries. The orchestrator's job is decomposition and synthesis, not direct system access — it should rarely call production tools itself. Keeping that discipline means the dangerous, side-effecting calls happen in a small number of well-audited leaf agents, which is far easier to review and secure than a sprawl of agents that all hold production credentials.

Frequently asked questions

Does the model ever connect to my database directly?

No. The model only emits structured tool-call intentions. An MCP server holds the real connection and credentials and performs the action. This separation is deliberate: it keeps secrets out of the model's context and gives you a deterministic place to validate and authorize every call before it reaches a system of record.

What exactly does the MCP layer add over just calling functions?

MCP standardizes discovery, schemas, and invocation across many systems, so one agent can talk to dozens of tools through a uniform interface and you can reuse the same server across multiple agents. It also cleanly separates the agent that reasons from the server that acts, which is the boundary where auth and permission scoping naturally belong.

How do I keep a long-running agent from filling the context window?

Externalize state. Maintain a task ledger in the harness, summarize completed sub-tasks, and drop verbose tool output once its conclusion is recorded. Feed the model only the relevant slice each turn. This controls cost, preserves the model's attention, and makes runs resumable after a failure.

When should I use multiple agents instead of one?

Use orchestrated subagents when sub-tasks are genuinely independent and benefit from parallelism or domain-specific context and tools. Because multi-agent runs cost several times more tokens, prefer a single well-structured agent until a task clearly outgrows it.

Bringing this architecture to your phone lines

CallSphere builds on exactly these layered patterns for voice and chat — agents that answer every call, validate every action against real systems through MCP-style tooling, and book work around the clock without leaking credentials or guessing at permissions. See the architecture in action at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

MCP Agent Architecture: How Claude Reaches Production

The five planes of a production MCP agent

How a single request flows end to end

Where the trust boundaries sit

State, memory, and the context window

Single agent versus orchestrated subagents

Frequently asked questions

Does the model ever connect to my database directly?

What exactly does the MCP layer add over just calling functions?

How do I keep a long-running agent from filling the context window?

When should I use multiple agents instead of one?

Bringing this architecture to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Code GTM engineering is heading next

Where Claude Cowork is heading and how to prepare

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild