Anthropic's marketing page lists a 1 million token context window for Claude Opus 4.6 and Sonnet 4.6. Gemini 2.5 Pro lists 1M with experimental 2M. GPT-5.4 lists 400K. These are all real numbers — the API will accept that many tokens without erroring. They are also all misleading, in the same specific way: the model can read the tokens, but its ability to retrieve, reason over, and act on them degrades long before the advertised limit.

This post walks through the evidence that "context window" and "effective memory" are different quantities, explains why they are different, and gives you a hard rule of thumb for production systems that need precise retrieval over large documents.

The claim

Long context is one of the most-marketed capabilities in modern LLMs. The implicit promise: you can stuff 500 pages of legal contracts, an entire codebase, or six months of customer logs into a single prompt and the model will reason over all of it as fluidly as a 5K-token chat.

The reality: at 100K+ tokens, every frontier model exhibits measurable degradation on retrieval, instruction adherence, and multi-needle reasoning. The degradation curve differs by model and task, but no model maintains short-context performance at advertised maximum.

What the data actually shows

The first popular long-context test was Greg Kamradt's needle-in-haystack benchmark from 2023, which placed a single specific fact at varying depths in a long document and asked the model to retrieve it. By 2024, frontier models had effectively saturated this test, scoring near 100% across their full advertised windows. This is when the marketing started.

Subsequent benchmarks revealed that single-needle retrieval was the easy version of the problem.

MRCR, RULER, and LongBench

flowchart TD
    A[Single needle retrieval] --> B[Easy: saturated by 2024]
    B --> C[Multi-needle retrieval]
    C --> D[Hard: degrades by 50K tokens]
    D --> E[Multi-needle + reasoning]
    E --> F[Very hard: degrades by 25K tokens]
    F --> G[Multi-needle + instruction following]
    G --> H[Production reality: degrades earlier]
    H --> I[Real-world failure: stale or hallucinated answers]

MRCR (Multi-Round Coreference Resolution) asks the model to retrieve and reason over multiple related facts scattered across the document, with coreference chains the model has to maintain. This is much harder than single-needle. As of April 2026, Claude Opus 4.6 publishes 78.3% on MRCR v2 at 1M tokens, which is leading the field but still well below short-context performance on the same kind of task.

RULER is a multi-task long-context benchmark with categories for retrieval, multi-hop tracing, aggregation, and question answering. RULER results consistently show frontier models maintaining strong performance through about 32K tokens, then declining steadily. By 128K, most models have lost 15 to 30 points relative to their 32K performance. By 1M, the decline is steeper.

LongBench is multi-task and multilingual, and it tests the additional dimension that benchmarks like MRCR do not: instruction adherence at length. Models that retrieve facts correctly at 100K can still fail to follow output formatting instructions, ignore late-stage system prompts, or revert to default behaviors when the relevant instruction is buried near the start.

The mid-document attention problem

A consistent finding across all three benchmarks: information placed at the very start or very end of the context is retrieved more reliably than information in the middle. This "lost in the middle" effect was first documented by Stanford's Liu et al. and remains true in 2026, just less severe than it was. For Claude specifically, retrieval accuracy on facts placed between roughly 30% and 70% of the way through the context is meaningfully lower than at the endpoints.

In practical terms: if you put a critical instruction in the middle of a 200K-token prompt, the model is more likely to ignore it than if you put the same instruction at the start or end. This is the opposite of how humans read.

Multi-needle reasoning

Single-needle retrieval is finding one fact. Multi-needle reasoning is finding several related facts and combining them. Every frontier model is materially worse at multi-needle than single-needle, and the gap widens with context length. Claude is among the better performers here, but "better" still means double-digit point drops at 128K versus 32K.

Why this happens (technical)

Three architectural realities drive the degradation.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

Attention dilution. Transformer attention distributes a fixed budget of "where to look" across all input tokens. As input length grows, the per-token attention budget shrinks. The model can still attend everywhere, but with less precision. Recent architectural improvements — sliding window attention, sparse attention, ring attention — mitigate but do not eliminate this.

Position encoding decay. Rotary position embeddings (RoPE) and similar schemes work well at training-distribution lengths and decay outside them. Models trained mostly on 32K examples and extended to 1M through length-extrapolation techniques exhibit cleaner attention near the trained range and noisier attention beyond it. Claude's training mix includes long-context data, but the distribution still tilts toward shorter examples.

Instruction-following interference. Long context contains more material that competes with system prompts and user instructions. A 200K-token document includes thousands of imperative sentences, and the model's instruction-following pathways have to distinguish "real" instructions from textual content. The longer the context, the more interference, the more the model regresses to default behaviors.

The 1M context comparison

Model	Advertised window	Strong-retrieval band	Steep decline begins
Claude Opus 4.6	1M	up to ~64K	~256K
Claude Sonnet 4.6	1M	up to ~64K	~256K
Gemini 2.5 Pro	1M (2M experimental)	up to ~128K	~512K
GPT-5.4	400K	up to ~64K	~200K
GPT-5.2	200K	up to ~32K	~100K

These bands are approximate and task-dependent. Gemini holds long context unusually well on retrieval-heavy tasks but is not consistently stronger on multi-needle reasoning. Claude leads on coherent generation over long input. GPT excels at shorter contexts but holds up reasonably to ~200K.

When to use long context vs RAG

Long context is real and useful. The mistake is treating it as a replacement for retrieval-augmented generation in cases where retrieval would be cheaper, faster, and more reliable.

Use long context when

The task requires holistic reasoning over the full document (full-document summarization, cross-document synthesis, codebase-wide refactoring planning).
The document is small enough to fit entirely in the strong-retrieval band.
Cost and latency are not primary constraints.
Precision retrieval is not the goal — comprehension is.

Use RAG when

The corpus is larger than the strong-retrieval band.
Precise retrieval of specific facts is the goal.
Cost matters: retrieving 5K relevant tokens is far cheaper than feeding 500K.
Latency matters: long-context calls are slower.
The corpus updates frequently and you want fresh data without re-running prompts.

Use both when

You need precise retrieval (RAG) plus holistic reasoning (long context). Retrieve the relevant subset, then expand with surrounding context, then send a reasonably-sized prompt that fits in the strong-retrieval band.

Implications for production

The single most useful rule of thumb we have derived from running long-context workloads in production: use no more than 25% of the advertised context window for tasks that require precise retrieval or strict instruction adherence. For Claude Opus 4.6's 1M window, that is 250K. For practical purposes, we set our hard ceiling at 200K and our preferred working range at 64K.

For comprehension tasks — summarization, qualitative analysis, sentiment over long transcripts — you can push further, often to 50% of the window, because the failure mode is graceful degradation rather than missed facts.

The other production-tested heuristics:

Place critical instructions at the start and end of the prompt, never in the middle.
Repeat key instructions if the prompt is over 50K tokens. Redundancy is cheaper than failure.
Test your specific workload with synthetic needles inserted at multiple depths. Depth-specific failures are common and unpredictable from benchmarks alone.
Cache aggressively. Anthropic's prompt caching gives you long-context economics that look more like RAG.

What CallSphere does

Our healthcare voice agents (14 tools) hit Claude with carefully sized context that pulls only the relevant patient and clinic context per turn — typically 8K to 20K tokens, well inside the strong-retrieval band. For after-hours overflow (7 agents) and IT helpdesk (10 agents plus RAG), we pair RAG retrieval with Claude or Gemini analytics so each prompt stays under 32K. Where we do use long context — full-shift transcript analysis, weekly trend reports across hundreds of calls — we route to Gemini for its longer strong-retrieval band, or to Claude with prompt caching for cost. Voice itself runs on the OpenAI Realtime API for latency reasons and never approaches long-context territory.

FAQ

Q: Is the 1M context window real? The API will accept 1M tokens. The model's reasoning quality at 1M is not the same as at 32K. Both statements are true.

Q: Where does Claude actually start to degrade? Noticeable degradation on multi-needle and instruction-adherence tasks begins around 64K to 128K tokens, with steep decline past 256K. Single-needle retrieval holds up much further.

Q: Should I use long context or RAG? RAG is cheaper, faster, and more reliable for precision retrieval. Long context is better for holistic comprehension. Most mature production systems use both.

Q: Does prompt caching help? Yes, dramatically, for economics. Caching does not change the model's effective memory, only the cost of using long context.

Q: Which model has the best long context as of April 2026? Gemini 2.5 Pro is strongest on retrieval at very long contexts. Claude Opus 4.6 is strongest on coherent reasoning over long input. GPT-5.4 is competitive up to ~200K and falls off thereafter.

Q: How do I know if my prompt is too long? The best signal is your private eval. Run the same task at progressively longer context lengths with a held-out evaluation set, and find the inflection point where accuracy starts to drop measurably. That inflection — not the API limit — is your practical ceiling. For most production teams using Claude Sonnet 4.6, the inflection lands somewhere between 48K and 96K tokens depending on task type.

Q: Does adaptive thinking help long-context reasoning? Yes, modestly. Extended reasoning gives the model more compute to traverse the context, which helps multi-hop and aggregation tasks. It does not fix attention dilution at 500K tokens, and it adds latency that may not fit voice or real-time use cases.

A note on benchmark recency

Long-context benchmarks have improved faster than most other LLM benchmark categories, because they were obviously broken (single-needle saturation) and because synthetic test generation makes contamination harder. MRCR v2, RULER, and the most recent LongBench releases are credible signals as of April 2026. If you are reading benchmark results published before mid-2024, they are almost certainly using easier tests than the current frontier and will overstate model capability at long context.

The 1M context window is a real capability and a real marketing flourish at the same time. As of April 2026, treat advertised maximums as the upper bound of physical possibility, not the upper bound of practical utility. Build your systems around the strong-retrieval band, not the headline number.

#LongContext #Claude200K #RAG #ContextEngineering #LLMMemory #CallSphere

The 200K Context Window That Wasn't: Claude's Effective Memory Tested Under Load

The claim

What the data actually shows

MRCR, RULER, and LongBench

The mid-document attention problem

Multi-needle reasoning

Why this happens (technical)

The 1M context comparison

When to use long context vs RAG

Use long context when

Use RAG when

Use both when

Implications for production

What CallSphere does

FAQ

A note on benchmark recency

Try CallSphere AI Voice Agents

Related Articles You May Like

GraphRAG in Production: Neo4j, Microsoft, and Graphiti Implementations Compared

Choosing an Embedding Model in 2026: text-embedding-3, BGE, Voyage, Cohere

Agent Memory Patterns: Episodic, Semantic, and Procedural Stores in Production

IT Helpdesk RAG with ChromaDB: CallSphere 10 Agents vs Vapi

ChromaDB RAG for Voice Agents: CallSphere vs Vapi Knowledge Base

Understanding AI Voice Technology: A Beginner's Guide