Deploy Claude in Your Enterprise: A Build Walkthrough
Step-by-step guide to deploying Claude at scale: gateway, model routing, MCP tools, audit, approvals, and a safe rollout you can defend in review.
Most "how to deploy an LLM" guides stop at calling the API and printing the response. That is the easy 5%. The other 95% — identity, tool access, cost control, audit, and a rollout you can defend in a security review — is what actually takes time. This walkthrough is the version I wish I had the first time I shipped Claude into a regulated environment: concrete steps, in order, with the decisions called out where they matter.
We will build a support-automation agent that can look up orders, issue refunds within limits, and email customers. The same skeleton generalizes to internal tooling, data analysis, or coding agents. By the end you will have a deployment that authenticates users, routes across the Claude model tiers, talks to your systems only through MCP, and logs everything.
Key takeaways
- Start with a gateway that owns identity and tracing before you write a single prompt.
- Build one MCP server per backend system and validate every tool argument server-side.
- Route by task across Haiku 4.5, Sonnet 4.6, and Opus 4.8 from day one — retrofitting routing is painful.
- Add an approval gate for high-impact actions and an audit event for every tool call.
- Roll out behind a flag with shadow mode first; promote on evals, not vibes.
Step 1 — Lay the gateway and identity foundation
Before Claude enters the picture, build the front door. The gateway authenticates the caller (OIDC, mTLS, or your existing SSO), resolves their role, and stamps the request with a trace ID. Every downstream component — harness, MCP servers, audit log — keys off that trace ID. If you skip this and bolt identity on later, you will be threading user context through code that never expected it.
A minimal gateway contract looks like this: in comes a request plus a bearer token; out goes a normalized request object carrying actor, roles, and trace_id. Reject anything unauthenticated here, loudly, before it costs you a model call.
The gateway is also where you set per-user rate limits and attach an organization or tenant identifier. In a multi-tenant deployment this tenant tag is load-bearing: it is what later scopes every retrieval and tool call so one customer's agent can never see another's data. Resolve it once, here, and carry it everywhere — retrofitting tenancy after the fact means auditing every query in the system, which is a project nobody enjoys. Treat the gateway as the only component that talks to the outside world untrusted; everything behind it gets to assume identity is already established.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 2 — Wire the model call and the tool loop
Now the core. The harness sends the system prompt, the permitted tool definitions, and the conversation to Claude, then runs the agent loop: if the model returns tool calls, execute them, append the results, and call again until it returns a final message. Below is the shape of that loop with the Anthropic SDK.
import anthropic
client = anthropic.Anthropic()
def run_turn(messages, tools, model="claude-sonnet-4-6"):
resp = client.messages.create(
model=model, max_tokens=1024,
system=SYSTEM_PROMPT, tools=tools, messages=messages,
)
if resp.stop_reason == "tool_use":
for block in resp.content:
if block.type == "tool_use":
result = dispatch_to_mcp(block.name, block.input) # policy + schema here
messages.append({"role":"assistant","content":resp.content})
messages.append({"role":"user","content":[{
"type":"tool_result","tool_use_id":block.id,
"content": result }]})
return run_turn(messages, tools, model)
return respThe important line is dispatch_to_mcp: it is where policy and schema checks run before anything touches a real system. Keep the loop boring and put all the judgment behind that call.
flowchart TD
A["Request at gateway"] --> B["Resolve identity & trace_id"]
B --> C["Router: classify with Haiku"]
C --> D["Pick model tier"]
D --> E["Agent loop: call Claude"]
E --> F{"Tool call?"}
F -->|Yes| G["MCP: policy + schema check"]
G --> E
F -->|No| H["Return answer & write audit log"]Step 3 — Build the MCP server for each system
Each backend gets its own MCP server exposing only the operations the agent needs. For our support agent that is get_order, issue_refund, and send_email. Define tight JSON schemas — issue_refund takes an order_id string and an integer amount_cents, nothing more. Validate on the server, not in the prompt, because a determined model will eventually send something off-spec and you want the server to reject it.
Inside the server, enforce the business rules: a refund over a threshold returns a "needs approval" result instead of executing. The model then surfaces that to a human. This keeps dangerous actions behind a real gate while still letting the agent drive the happy path.
One server per system is deliberate. It keeps blast radius small — a bug or compromise in the email server cannot touch billing — and it lets you grant each agent only the servers it needs. It also keeps schemas legible: a focused server with three well-named tools is far easier for the model to use correctly than one sprawling server with thirty. Resist the urge to build a single "do everything" server because it is convenient to wire; the convenience evaporates the first time you need to reason about what an agent can reach.
Step 4 — Add routing across the model tiers
Put a cheap classifier in front of the expensive work. A single Haiku 4.5 call tags each request as simple, standard, or complex, and that tag selects Haiku, Sonnet 4.6, or Opus 4.8 for the real turn. Build this on day one. Retrofitting routing after you have hardcoded one model everywhere means touching every call site and re-running your evals.
| Decision | Do this | Avoid this |
|---|---|---|
| Model choice | Route by classified difficulty | One model for everything |
| Tool access | MCP server with schemas | Direct DB calls from harness |
| Guardrails | Server-side policy checks | "Please don't" in the prompt |
| Rollout | Flag + shadow + evals | Big-bang to all users |
Step 5 — Instrument audit, budgets, and approvals
Emit a structured audit event for every tool call: actor, on-behalf-of, tool, arguments, result status, model tier, trace ID, and token counts. This is both your security record and your cost telemetry. Set a per-request and per-user token budget in the harness; when a run blows the budget, stop it and return a clear error rather than letting an agent loop forever. For high-impact tools, route the call to an approval queue and resume when a human signs off.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The approval pattern is worth implementing carefully because it is what lets you ship sooner. Rather than waiting until you fully trust the agent on refunds, you let it draft the refund and pause for a human click. Early on, almost everything routes to approval; as your evals and audit logs build confidence, you raise the auto-approval thresholds tool by tool. This gives you a dial between full automation and full supervision instead of a binary launch decision, and it means a single agent can graduate from "assisted" to "autonomous" on different tools at different times.
Step 6 — Roll out safely in 5 moves
- Ship behind a feature flag scoped to a small internal cohort.
- Run in shadow mode: the agent proposes actions and logs them, but a human executes — compare proposals to reality.
- Build an eval set from real transcripts and gate promotion on it passing.
- Enable real actions for low-risk tools first; keep refunds and emails behind approval.
- Widen the cohort gradually, watching the audit log and budget dashboards at each step.
Common pitfalls
- Calling the API before building the gateway. Identity bolted on late leaks into every layer. Do it first.
- Trusting tool arguments from the model. Validate server-side; the model is a smart but untrusted client.
- No budget on the agent loop. A confused agent will happily spin through tokens. Cap it.
- Skipping shadow mode. Real traffic surfaces edge cases your evals miss; watch it propose before it acts.
- One giant MCP server. Split by system so blast radius and permissions stay scoped.
Frequently asked questions
How long does a first deployment realistically take?
A focused team can stand up the gateway, one MCP server, the agent loop, and basic audit in a couple of weeks. The longer tail is evals, approvals, and the security review — budget several weeks for hardening before broad rollout.
Do I need a multi-agent setup to start?
No. Begin with a single agent and a clean tool loop. A multi-agent system is a deliberate choice you make later for parallelizable work, and it typically uses several times more tokens, so add it only when the workload demands it.
What goes in the system prompt versus a Skill?
Keep the system prompt small and stable — role, constraints, tone. Put procedural, occasionally-needed knowledge in Skills so Claude loads it only when relevant, which keeps the base context lean.
How do I test tool calls without hitting production systems?
Point your MCP servers at sandboxed backends in non-prod and replay recorded transcripts. Shadow mode in production then validates against real data without taking real actions.
Bringing agentic AI to your phone lines
CallSphere runs this exact deployment playbook for voice and chat agents that authenticate callers, act through governed tools, and book work 24/7. Watch a production agent handle live calls at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.