Skip to content
Agentic AI
Agentic AI9 min read0 views

Claude Enterprise Architecture: How It Fits Together

Trace a request through Claude's enterprise architecture: gateway, model tier, MCP tools, memory, and the control plane that keeps it secure at scale.

The first time a team moves Claude from a clever prototype to something running across an enterprise, the question stops being "can the model do this?" and becomes "where does each piece live, and who is allowed to touch what?" A demo is one process talking to one API. A production deployment is a layered system: a model that reasons, a set of tools it can call, a memory it can read and write, an identity layer that decides what it sees, and an audit trail that proves what happened. Understanding how these layers fit together end to end is the difference between an agent you trust on a Tuesday afternoon and one you trust at 2 a.m. on a release night.

This post walks the full architecture of a Claude deployment built for scale. We will not cover any one tool in isolation; instead we trace a single request from the moment a user types it to the moment Claude returns an answer, naming every component it passes through and explaining why that component exists.

Key takeaways

  • A production Claude deployment has five planes: a request gateway, the model (Opus 4.8, Sonnet 4.6, Haiku 4.5), a tool layer (MCP servers and Skills), a state layer (memory and context), and a control plane (identity, policy, audit).
  • The model is stateless between calls; everything an agent "remembers" is reconstructed into context on each turn by the orchestration harness.
  • MCP servers are the only sanctioned door to your data, which is what makes them the right place to enforce auth, schemas, and rate limits.
  • Routing requests across the Opus/Sonnet/Haiku tier by task is the single biggest lever on both cost and latency.
  • Audit logging belongs at the gateway, not the model — capture every tool call and its arguments, because that is your forensic record.

What does "the architecture" actually mean here?

When engineers say "Claude's enterprise architecture," they rarely mean the neural network itself. They mean the surrounding system that turns a language model into a dependable service. The model is a stateless function: text and tool definitions go in, text and tool-call requests come out. It holds no memory of the previous request and no privileged access to your systems. Everything that makes it feel like a coworker — knowing your codebase, remembering a prior decision, fetching a customer record — is supplied by the layers around it.

This statelessness is a feature, not a limitation. Because the model carries no hidden state, every behavior is reproducible from its inputs, and every input is something you control and can log. The architecture's job is to assemble the right inputs (context, tools, instructions) for each turn and to safely act on the model's outputs.

A useful definition to anchor on: an enterprise Claude deployment is a request-processing pipeline in which a stateless reasoning model is wrapped by an orchestration harness that injects context, mediates tool access through MCP, and enforces identity and audit at a control plane. Hold that sentence in mind as we trace a request through it.

How a single request flows end to end

Let us follow one prompt — "Refund order #4821 and email the customer" — through the whole stack. It enters at the gateway, which authenticates the caller and attaches their identity. The orchestration harness then builds the context window: the system prompt, any relevant Skills, recent conversation, and the catalog of tools this user is permitted to use. That bundle goes to the model. The model decides it needs two tools, emits structured tool-call requests, and the harness routes each one to the correct MCP server. Results flow back, the model composes a reply, and the gateway logs the whole exchange before returning it.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["User request"] --> B["Gateway: authN & identity"]
  B --> C["Orchestration harness builds context"]
  C --> D{"Claude: tools needed?"}
  D -->|No| E["Model answers directly"]
  D -->|Yes| F["Route tool call to MCP server"]
  F --> G["Policy & schema check"]
  G --> H["System of record returns data"]
  H --> C
  E --> I["Gateway logs & returns answer"]

The loop in the diagram is the heart of it. The model never reaches your database; it asks the harness to, and the harness asks an MCP server, which is the one place policy and schema validation live. That indirection is what makes the system auditable. Each arrow back into the harness is also a chance to re-check budgets, redact fields, or halt a run that is misbehaving.

The model tier: Opus, Sonnet, and Haiku as one fleet

Treating "Claude" as a single model is the most common architectural mistake at scale. In production you run a fleet. Claude 4.x gives you three tiers — Opus 4.8 for the hardest reasoning, Sonnet 4.6 for the everyday balance of speed and capability, and Haiku 4.5 for high-volume, latency-sensitive classification and extraction. The harness should route each turn to the cheapest tier that can do the job.

A practical pattern is a router step: a fast Haiku call classifies the incoming request, and that classification picks the model for the real work. A simple ticket triage goes to Haiku; a multi-step refactor or a financial reconciliation goes to Opus. This single decision often moves blended cost more than any prompt optimization, because the expensive turns become a small fraction of total volume.

ModelBest forTrade-off
Opus 4.8Deep reasoning, long multi-step agents, hard codeHighest cost & latency
Sonnet 4.6General agents, most tool-using workflowsBalanced default
Haiku 4.5Classification, extraction, routing, high QPSLess depth on hard tasks

Tools and state: MCP servers and the memory layer

Tools reach the model through the Model Context Protocol. Model Context Protocol is an open standard, introduced in November 2024, that connects Claude to external tools and data through MCP servers, each exposing a typed set of callable functions and resources. Architecturally, an MCP server is a façade in front of a real system — a database, a CRM, a billing API — that presents only the operations you want the agent to have, with schemas the model can read and arguments you can validate.

State is separate from tools. The model is stateless, so "memory" is something the harness materializes: a vector store for retrieval, a key-value store for durable facts, and the running conversation buffer. On each turn the harness decides what slice of that memory belongs in context. Getting this slice right is its own discipline, but architecturally the point is simple: memory is a service the harness queries, never something baked into the model.

The control plane: identity, policy, and audit

The control plane is what makes the difference between "an agent" and "an agent we can put in front of an auditor." It has three responsibilities. Identity: every request carries the acting user's identity, and tool access is scoped to what that user could do by hand. Policy: rules that gate which tools can run, with what arguments, and within what budget. Audit: an immutable record of every prompt, every tool call, and every result.

Put all three at the gateway and MCP boundary, never inside the prompt. A prompt instruction like "don't refund more than $500" is a suggestion; a policy check at the MCP server that rejects the call is a control. The architecture earns its keep precisely when the model tries something it should not — and a real boundary, not a sentence, stops it.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

{
  "tool": "issue_refund",
  "args": { "order_id": "4821", "amount_cents": 2999 },
  "actor": "agent:support-bot",
  "on_behalf_of": "user:csr-114",
  "policy": { "max_refund_cents": 50000, "requires_approval_over_cents": 20000 }
}

The snippet shows the shape of a mediated tool call: the action, its arguments, who is acting and for whom, and the policy evaluated before execution. This object is what you log and what your control plane reasons about.

Ship a reference architecture in 6 steps

  1. Stand up a gateway that authenticates callers and stamps every request with an identity and a trace ID.
  2. Put the orchestration harness behind it; have it assemble context, Skills, and the permitted tool catalog per request.
  3. Wrap each backend system in an MCP server with typed schemas — never let the harness hit a database directly.
  4. Add a router that classifies requests with Haiku and dispatches to Opus or Sonnet by difficulty.
  5. Enforce policy and budgets at the MCP boundary, and emit a structured audit event for every tool call.
  6. Wire memory (retrieval + durable store) as a service the harness queries, with redaction before anything enters context.

Common pitfalls

  • Letting the harness query databases directly. You lose the one chokepoint where auth and schema validation belong. Always go through MCP.
  • Running everything on Opus. It works in a demo and quietly triples your bill at scale. Route by task.
  • Putting guardrails in the prompt. Prompt-level limits are advisory. Enforce hard limits in the control plane.
  • Treating memory as model state. Stuffing unbounded history into context degrades reasoning and cost. Curate the slice per turn.
  • Logging only the final answer. Without the intermediate tool calls and arguments, you cannot reconstruct what the agent actually did.

Frequently asked questions

Is Claude stateful across requests?

No. The model is stateless. Anything that looks like memory is reconstructed into the context window by the orchestration harness on every turn, which is why your inputs are fully auditable and reproducible.

Where should authentication and authorization live?

At the gateway for the caller's identity, and at the MCP server boundary for each tool action. Scope tools to what the acting user could do manually, and validate arguments against a schema before execution.

Do I need all three model tiers?

Most serious deployments benefit from routing. Use Haiku 4.5 for high-volume classification, Sonnet 4.6 as the default, and Opus 4.8 for the genuinely hard turns. Routing is usually the largest single lever on cost and latency.

What is the minimum audit record?

For each turn: the resolved prompt, the model and tier used, every tool call with its arguments and result, the acting identity, and a trace ID linking them. That set lets you reconstruct any agent action after the fact.

Bringing agentic AI to your phone lines

CallSphere builds these same layered agent architectures for voice and chat — assistants that authenticate the caller, reason over your systems through tools, and book real work around the clock. See the architecture in action at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.