Claude Opus in Claude Code: A Real End-to-End Build

Most writing about agentic coding stays at the altitude of principles. This post goes the other way. We are going to walk a single, ordinary feature from a vague ticket to merged-and-deployed code using Claude Opus inside Claude Code, and pay attention to the unglamorous decisions that actually determine whether the run succeeds. No invented metrics, no heroic one-shot. Just the real shape of the work when it goes well.

The task: a SaaS team needs to add per-organization rate limiting to a public API. The ticket says, roughly, "customers are hammering the API, add limits." That sentence is the start of the problem, not the spec — and turning it into something an agent can execute is the first real skill on display.

Step one: turn the ticket into a spec the agent can't misread

Before invoking Opus, the engineer writes a short brief in a scratch file. Limits are per organization, not per user. The default is a configurable requests-per-minute ceiling. Over-limit requests return HTTP 429 with a Retry-After header. State lives in the existing Redis instance, because the service is already multi-instance and in-memory counters would be wrong. Existing middleware patterns in the auth layer should be followed. Done means: middleware added, unit tests for under-limit and over-limit, an integration test against a test Redis, and no change to the public response shape for allowed requests.

This brief is maybe twelve lines, and it is the highest-leverage twelve lines in the whole exercise. It converts an ambiguous ask into explicit constraints, names the files and patterns in scope, and defines a checkable finish line. The agent now has something to build toward instead of guess at.

Step two: let Claude Opus plan before it writes

The engineer points Claude Code at the brief and asks Opus to produce a plan first — which files it will touch, the order of operations, and where it is unsure. This is deliberate. A plan is cheap to review and cheap to correct, and catching a wrong assumption here saves a dozen wasted edit loops later.

flowchart TD
  A["Vague ticket"] --> B["Engineer writes spec"]
  B --> C["Opus proposes plan"]
  C --> D{"Plan sane?"}
  D -->|No| B
  D -->|Yes| E["Opus edits files & writes tests"]
  E --> F{"Tests & lint pass?"}
  F -->|No| E
  F -->|Yes| G["Human review of diff"]
  G --> H["Merge & deploy"]

The plan comes back reasonable but with one flaw: Opus proposes a fixed-window counter, which allows a burst at the window boundary. The engineer pushes back in one sentence — use a sliding window so boundary bursts can't double the limit. That single correction, made against a plan rather than against finished code, is the kind of intervention that separates a smooth run from a frustrating one.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Step three: the agent writes, runs, and fixes its own failures

With the plan agreed, Opus edits the middleware, adds the Redis-backed sliding-window logic, and writes the tests called for in the spec. Then Claude Code runs the suite. The first run fails: a test asserting Retry-After is wrong because the agent computed the header in seconds but the test expected milliseconds. The agent reads the failure, picks the convention used elsewhere in the codebase, fixes it, and reruns. Green.

This loop — write, run, read failure, fix — is the core of what makes an agentic tool more than autocomplete. The engineer is not in this inner loop. They set the spec and the gate, and the agent grinds against the tests until they pass. The human cost was the brief and one architectural correction; the agent absorbed the iteration.

Step four: the review that catches what tests can't

Passing tests are necessary, not sufficient. The engineer reviews the diff with a specific lens: not "is this syntactically fine" — the tests cover that — but "is the judgment right." Two things surface. First, the rate-limit key is built from the org ID alone, which is correct, but the engineer adds a comment explaining why per-user keying was rejected, so the next person doesn't reverse it. Second, the over-limit path logs the full request body, which could capture sensitive fields. The engineer flags it and Opus trims the log to method and path.

That second catch is exactly the kind of thing eval gates miss and humans must hold: a security-shaped judgment call that no unit test was looking for. The review was fast because the diff was scoped and the spec was clear, which is the whole argument for doing the spec work up front.

A detour that almost derailed the run

It is worth being honest about a moment that did not go cleanly, because real builds always have one. Midway through, the engineer realized the existing Redis client in the codebase used a connection pool with a low default timeout that would not hold up under the new per-request load. They asked Opus to bump the timeout, and the agent obligingly did — but it also started "helpfully" refactoring the surrounding connection-handling code, which was out of scope and touched a shared module other services depended on.

This is the scope-creep failure mode in miniature, and the right response was not to let it ride. The engineer stopped the run, reverted the connection-handling changes, and re-scoped: change only the timeout value, leave the pooling logic alone, and note the broader concern as a follow-up ticket. The lesson is concrete — when an agent expands beyond the brief, the cheap move is to stop and narrow, not to review a sprawling diff after the fact. Catching it mid-run cost a minute; catching it in review would have cost an argument about a module nobody meant to change.

Step five: ship, then watch

The change goes out behind a feature flag, limits set high enough to affect nobody at first. The team watches 429 rates and Redis latency for a day, then tightens the ceiling to the intended value. Nothing dramatic happens, which is the point. The deployment is boring because the risk was contained at every step: a clear spec, a reviewed plan, an eval gate, a focused human review, and a flagged rollout that is trivial to dial back.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Step back and notice where the human time actually went. Not into typing the middleware — the agent did that. It went into specifying the problem, correcting one design choice, catching a scope-creep detour, and reviewing for judgment and security. That redistribution of effort, from implementation toward specification and verification, is the real story of building with Claude Opus in Claude Code.

It is also worth naming what made the whole run feel calm rather than risky, because that calm was engineered, not lucky. The work happened on a feature branch, so every change was a revertible commit. The eval gate ran automatically, so wrong work could not silently pass. The diff stayed small enough to review with full attention. And the rollout hid behind a flag, so the worst-case production impact was a config toggle away from zero. None of these are exotic; together they turn a powerful, autonomous agent into a teammate you can hand real work to. Strip any one of them out and the same run becomes a gamble. That is the quiet discipline behind a build that ships without drama.

If there is a single takeaway from watching this feature go from a one-line ticket to production, it is that the agent did not replace the engineer; it relocated the engineer's effort. The valuable human contributions were front-loaded and back-loaded — a precise spec at the start, a plan correction and a scope-creep catch in the middle, a judgment-and-security review at the end. The repetitive middle, the actual writing and rewriting of code against a test suite, is where Claude Opus carried the load and never tired. A team that internalizes this stops measuring its engineers by how much code their hands produce and starts measuring them by how clearly they can specify, how well they can verify, and how sound their judgment is at the boundaries. That is a more durable definition of engineering skill, and it is the one this way of working rewards.

Frequently asked questions

Why write a spec by hand if the agent can infer requirements?

Because inference fills gaps with guesses, and confident wrong guesses are expensive to find later. A twelve-line spec converts ambiguity into checkable constraints and is the cheapest way to keep the agent building the right thing.

What was the agent genuinely good at in this walkthrough?

The mechanical iteration: writing the implementation and tests, running the suite, reading failures, and fixing them without human help. That inner loop is where an agentic tool pulls decisively ahead of line-by-line assistance.

Where did the human still have to step in?

At the edges agents handle worst — choosing the right algorithm during planning, and catching a security-sensitive logging choice during review. Those are judgment calls, and they're exactly where focused human attention earns its keep.

Bringing agentic AI to your phone lines

CallSphere runs this same problem-to-shipped loop for voice and chat: multi-agent assistants that answer every call and message, use tools mid-conversation, and book work 24/7. See a live build at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Claude Opus in Claude Code: A Real End-to-End Build

Step one: turn the ticket into a spec the agent can't misread

Step two: let Claude Opus plan before it writes

Step three: the agent writes, runs, and fixes its own failures

Step four: the review that catches what tests can't

A detour that almost derailed the run

Step five: ship, then watch

Frequently asked questions

Why write a spec by hand if the agent can infer requirements?

What was the agent genuinely good at in this walkthrough?

Where did the human still have to step in?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild