Skip to content
Agentic AI
Agentic AI8 min read4 views

Inside Claude Agent Architecture: How the Pieces Fit

An end-to-end look at how a Claude agent is wired in 2026 — the model loop, context window, tools, MCP servers, skills, and subagents working together.

The first time you watch a Claude agent debug a failing test, fetch a schema over MCP, write a patch, and re-run the suite — all without you touching anything — it looks like magic. It isn't. Underneath is a surprisingly legible architecture: a model running in a loop, a context window it reads and writes, a registry of tools it can call, and a small number of orchestration primitives that decide what happens next. The teams who ship reliable agents in 2026 are the ones who understand that architecture well enough to reason about where things break. This post walks the whole thing end to end.

What an agent actually is under the hood

An agent, stripped to its core, is a language model placed inside a control loop with access to tools and a durable record of what it has done. An agentic system is a loop in which a model repeatedly observes state, decides on an action, executes that action through a tool, and feeds the result back into its own context until a goal is reached or a stop condition fires. Everything else — Claude Code, the Agent SDK, multi-agent orchestration — is scaffolding around that single idea.

With Claude specifically, the model at the center is one of the Claude 4.x family: Opus 4.8 when you need the deepest reasoning, Sonnet 4.6 for the everyday workhorse balance of speed and capability, Haiku 4.5 for cheap high-volume steps. The loop hands the model a context window — up to a million tokens in Claude Code — containing the system prompt, the task, the conversation so far, and the tool results accumulated along the way. The model emits either a final answer or a tool call. If it's a tool call, the harness executes it, appends the result, and runs the model again. That cycle is the heartbeat of every agent you'll build.

The context window as working memory

The single most important architectural surface is the context window, because it is the only thing the model can see. There is no hidden memory. If a fact isn't in context — a file's contents, a prior decision, a tool's output — the model cannot use it. This reframes a lot of agent engineering as context management: deciding what to load, when to load it, and what to evict before the window fills.

Claude agents manage this on several layers. The system prompt holds stable instructions and identity. The conversation transcript holds the running history. Tool results stream in as the agent works. And because long-running agents would otherwise overflow even a million tokens, the harness compacts: it summarizes older turns, drops stale file reads, and keeps the live working set lean. Understanding that the window is finite and actively curated explains why well-scoped tasks succeed where sprawling ones drift.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart TD
  A["Task + system prompt"] --> B["Context window assembled"]
  B --> C["Claude model runs"]
  C --> D{"Tool call or answer?"}
  D -->|Answer| E["Return result"]
  D -->|Tool call| F["Harness executes tool / MCP server"]
  F --> G["Result appended to context"]
  G --> H{"Window near limit?"}
  H -->|Yes| I["Compact & summarize older turns"]
  H -->|No| C
  I --> C

Tools and MCP: the agent's hands

A model that can only emit text is a chatbot. What makes it an agent is the ability to act, and actions happen through tools. A tool is a function with a name, a description, and a JSON schema for its inputs; the model decides when to call it and with what arguments, and the harness runs the real code. In Claude Code, the built-in tools include reading and writing files, running shell commands, searching the codebase, and spawning subagents.

The Model Context Protocol extends this beyond the built-ins. Model Context Protocol (MCP) is an open standard, introduced by Anthropic in November 2024, that lets Claude connect to external tools and data sources through standardized server interfaces. An MCP server might expose your production database, a Jira instance, a payments API, or an internal search system. From the model's point of view, MCP tools look identical to built-in ones — same call-and-result shape — which is exactly the point: the architecture is uniform regardless of where a capability physically lives.

Skills: instructions that load on demand

Tools give the agent capability; skills give it competence. An Agent Skill is a folder of instructions, scripts, and resources that Claude loads dynamically when the task calls for it. The key architectural trick is progressive disclosure: rather than stuffing every procedure into the system prompt and burning context on instructions that are usually irrelevant, the agent keeps a lightweight index of available skills and pulls the full content only when a task matches. A skill for generating brand-compliant PDFs sits dormant until someone asks for a PDF, then its detailed steps and helper scripts flow into context.

This pairing of MCP and Skills is deliberate. MCP connects Claude to a tool; the matching skill teaches Claude how and when to use that tool well — the auth quirks, the right sequence of calls, the gotchas your team learned the hard way. Together they turn a generic model into something that behaves like an experienced operator of your specific systems.

Subagents and orchestration

The last architectural layer is coordination. A single agent in a single context window handles a remarkable amount, but some work is too large or too parallel for one thread. Here Claude spawns subagents: child agents, each with their own fresh context window, dispatched to handle a bounded slice of the problem. An orchestrator agent might fan out four subagents to investigate four independent modules, then collect their summaries and synthesize a result.

The benefit is isolation and parallelism — each subagent reasons without the clutter of the others' context, and they run concurrently. The cost is tokens: multi-agent runs commonly consume several times more tokens than a single-agent approach, because every subagent re-establishes its own working context. The architectural discipline, then, is to reach for subagents when the work genuinely decomposes into independent parts, and to keep it single-threaded when it doesn't.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

How the layers compose at runtime

Put it together and a real session looks like this. You give Claude Code a task. The harness assembles context — system prompt, your request, relevant files. The model reasons and calls a built-in tool to search the repo, then an MCP tool to query a service, then loads a skill that codifies your deployment procedure. As the window fills, the harness compacts history. For a parallelizable subtask it spawns subagents, gathers their outputs, and continues. Eventually the stop condition — task complete, or a hook intervenes — ends the loop. Every one of those moves is a predictable interaction between the same handful of primitives. That's what makes the architecture something you can debug rather than pray to.

Frequently asked questions

Where does an agent store memory between steps?

In the context window itself, plus whatever the harness persists. There is no separate hidden memory; the running transcript and tool results are the agent's working memory. Durable state — files written to disk, records saved over MCP — survives across the loop, while in-context detail can be compacted or summarized as the window fills.

What's the difference between a tool and a skill?

A tool is an executable capability with a schema the model calls to take an action. A skill is loadable knowledge — instructions and resources — that teaches the model how and when to use tools effectively. Tools are the hands; skills are the training. MCP servers provide tools; Agent Skills provide the playbooks.

When should I use multiple agents instead of one?

Use subagents when the work splits cleanly into independent parts that benefit from isolated context or parallel execution — for example, investigating several unrelated files at once. For sequential, tightly coupled work, a single agent is simpler and far cheaper, since multi-agent runs can use several times more tokens.

Does the model see my entire codebase at once?

No. It sees only what's loaded into the context window. Claude Code reads files on demand via tools, so the agent pulls in exactly the files it needs and lets older ones fall out of context, rather than holding the whole repository in memory.

Bringing agentic AI to your phone lines

These same architectural building blocks — a model loop, managed context, tools, and coordinated subagents — power CallSphere's voice and chat agents, which answer every call and message, pull data mid-conversation, and book real work around the clock. See the architecture in action at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.