Skip to content
Agentic AI
Agentic AI7 min read0 views

End-to-End Claude Code: A 1M-Context Walkthrough

A realistic problem-to-shipped walkthrough of migrating a legacy module with Claude Code's 1M-token context — staged sessions, scoped writes, and gates.

Most articles about long context describe the feature. This one follows a single real-shaped task from the first "this is going to be painful" moment to a merged, shipped result. The task: migrate a legacy authentication module in a mid-sized service from a deprecated session library to a modern token-based approach. It touches dozens of files, has thin test coverage, and the original author left the company two years ago. Exactly the kind of work where a 1M-token context window earns its keep — and exactly the kind that punishes you for using it carelessly.

I am going to walk through it as a sequence of Claude Code sessions, because the discipline of where one session ends and the next begins is the whole game. The goal is not to show off the model. It is to show the judgment calls a human makes around it.

The problem and why long context fits

The deprecated library is a problem because it is woven through the codebase: middleware reads from it, controllers assume its shape, and a dozen tests stub it in subtly different ways. No single file explains the system. To make a correct migration, you genuinely need to understand how the pieces interact — which is the case where a large context window is the right tool, not a gimmick.

The trap is to load everything and ask for "the migration." That produces an enormous, unreviewable diff and a high chance of subtle breakage. So the plan is staged: first build understanding, then change one layer at a time, verifying at each gate. Long context informs every stage; it does not replace the staging.

Session one: build a map, change nothing

The first session has a deliberately narrow goal: understand and document, write no code. I load the auth module, the middleware, a representative controller, and the existing tests into context — broad read access — and ask Claude Code to produce a written map: where the deprecated library is used, what each usage assumes, and what the modern equivalent would be for each.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

This is the highest-value session and the one people skip. The output is a markdown plan I read carefully and correct. It catches two usages I would have missed and flags one place where the old library's behavior is load-bearing in a way the new one is not. I now have a verified plan, and — critically — a compact artifact I can carry into later sessions instead of re-loading everything.

flowchart TD
  A["Legacy auth module"] --> B["Session 1: map & plan"]
  B --> C{"Plan verified by human?"}
  C -->|No| B
  C -->|Yes| D["Session 2: migrate core lib"]
  D --> E["Session 3: update middleware"]
  E --> F["Session 4: fix tests"]
  F --> G{"Suite green & diff in scope?"}
  G -->|No| E
  G -->|Yes| H["Ship behind flag"]

Session two: migrate the core, scope the write

Now I open a fresh session against a clean branch. Into context goes the verified plan plus the specific files for the core token logic — not the whole repo. The instruction is precise: implement the new token module per the plan, add unit tests, change nothing outside this package. Narrow write scope is the safety mechanism; broad read context from the plan is the intelligence.

The session produces a focused diff and tests. I read it critically. One function handles token refresh in a way that subtly differs from the plan, so I correct it mid-session rather than restarting — prompt-and-correct fluency in action. The tests pass. I commit this as a small, reviewable increment. The migration is not done, but the foundation is in place and verified.

Notice the rhythm: each session starts clean, carries forward a compact artifact rather than raw context, scopes its writes narrowly, and ends with a verified commit. This is what keeps a multi-day agentic task from collapsing into an unreviewable mess.

Sessions three and four: middleware and tests

The third session updates the middleware to use the new module. Here the long context helps in a specific way: I load the new module's interface, the old middleware, and the relevant controllers so the agent sees how the pieces fit. The change is mechanical but cross-cutting, exactly where holding several files at once prevents the agent from breaking a caller it cannot see.

The fourth session is the one that always takes longer than expected: the tests. The old tests stub the deprecated library; they need to be rewritten against the new approach. I ask Claude Code to update them, then I scrutinize hard, because agents sometimes "fix" a failing test by weakening it. I find one test that now passes trivially and reject it, asking for a version that actually exercises the refresh path. Verification literacy is the difference between a green pipeline and a real safety net.

Shipping and what the staging bought

With the suite green and each increment reviewed, I merge behind a feature flag and roll out gradually, watching error rates before flipping it fully on. The flag is the final blast-radius control: even a missed bug affects a slice of traffic, not everyone, and rollback is one toggle.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step back and the value of staging is clear. The 1M-token window made each session smarter — the agent understood the system instead of pattern-matching one file. But the human structure — map first, narrow writes, verify each gate, ship behind a flag — is what made it safe and reviewable. Long context without that structure would have produced one giant diff and a tense deploy. With it, a two-year-stale module migrated cleanly over a few focused sessions.

Frequently asked questions

Why not just do the whole migration in one long session?

A single session would produce an enormous, unreviewable diff with a high chance of subtle breakage, and stale assumptions made early would compound. Staging into verified sessions keeps each diff small and each gate checkable, which is what makes long-context work trustworthy.

What should I carry between sessions instead of full context?

A compact, human-verified artifact — usually a written plan or map produced in an early session. Carrying the distilled understanding rather than raw files keeps later sessions focused and avoids re-anchoring on noise.

How do I stop the agent from weakening tests to make them pass?

Read every updated test as skeptically as production code. Check that it actually exercises the behavior, not that it merely passes. When you find a trivially passing test, reject it and ask for one that fails if the logic breaks.

Where does the 1M context window actually earn its value here?

In the cross-cutting steps — mapping the system and updating middleware — where the agent needs to see how many files interact at once. For narrow write steps, you scope context down; for understanding, you let it read broadly.

From code agents to phone-line agents

CallSphere brings this same staged, verified agentic approach to voice and chat — assistants that understand context across a conversation, call tools mid-call, and book real work without dropping the thread. See a live walkthrough at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.