Prompt and Context Design for AI Work Classification

You can have a perfect taxonomy, clean MCP wiring, and a frozen eval set, and still get bad numbers — because the classification prompt put the wrong things in Claude's context. Context design is the quiet determinant of accuracy in any Anthropic Economic Index-style system. Every token you include either sharpens the judgment or dilutes it, and the line between the two is not obvious. This post is about that line: what belongs in the context window for work classification, what should be ruthlessly excluded, and why each choice moves accuracy or cost.

Think of context as a budget you spend deliberately. The classifier has one job — pick a task and an interaction mode — and everything in the window should serve that job. The art is including the signal that decision needs while excluding the noise that distracts the model or leaks user data.

Key takeaways

Context for classification should contain the taxonomy, the conversation summary, and almost nothing else.
Inject the taxonomy as structured data per request, not baked into a static system prompt.
Summarize the conversation to the signal the judgment needs; full transcripts add cost and leak PII.
Few-shot examples help the mode judgment more than the task judgment — place them where they pay off.
Leave out user identity, raw metadata, and prior conversations — they bias the label and widen privacy risk.

What the classifier actually needs to see

Strip the problem to its essence. To assign a task and a mode, the model needs the list of valid tasks and a faithful description of what happened in the conversation. That is the whole job. Anything beyond those two ingredients is, at best, neutral and, at worst, a source of bias. A lean context for this task is a feature, not a limitation — it keeps the model focused on the decision and makes the judgment reproducible.

The temptation is always to add "helpful" context: the user's plan tier, the channel, the time of day. But the classifier is not predicting behavior; it is describing what occurred. Demographic or account context can nudge the model toward stereotyped labels and has no legitimate place in a task description. Excluding it is both more accurate and more defensible.

Injecting the taxonomy: data, not prose

The taxonomy is the most important thing in the window, and how you place it matters. Inject it as a structured list per request rather than embedding it in a fixed system prompt. Per-request injection lets you version the taxonomy, A/B different descriptions, and keep the system prompt stable. The diagram shows how context is assembled for each conversation.

flowchart TD
  A["Conversation"] --> B["Summarize to task-relevant signal"]
  C["Current taxonomy version"] --> D["Assemble context window"]
  B --> D
  D --> E{"Within token budget?"}
  E -->|No| F["Trim summary, keep taxonomy"]
  E -->|Yes| G["Send to Claude classifier"]
  G --> H["task_id + interaction_mode"]

Notice the trim rule when over budget: cut the summary, never the taxonomy. The taxonomy is the decision space — losing part of it means the model cannot pick a label it should have. The summary is compressible signal. When forced to choose, protect the vocabulary.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Summaries over transcripts

Feeding the raw transcript is the most common context mistake. It costs tokens proportional to conversation length, carries PII straight into your prompt, and buries the relevant signal under chitchat. A purpose-built summary fixes all three. Prompt the summarizer to capture only what the classifier needs: the action performed and the degree of human iteration. Here is the kind of summary that classifies well:

SUMMARY: User asked the agent to find why a scheduled job
stopped running. Agent read logs, proposed a config fix.
User questioned it, agent revised, user applied the change.
Human iterated twice before accepting.

That summary is forty words, contains no PII, and encodes both signals: the task (diagnose a failing job) and the mode (augmentation — the human iterated). It will classify as accurately as the full transcript while costing a fraction and leaking nothing.

Where few-shot examples earn their tokens

Examples are expensive context, so spend them where they move accuracy most. For task assignment against a well-described closed list, the model rarely needs examples — the descriptions carry it. The augment-vs-automate judgment is different: the boundary is fuzzy and benefits from two or three contrasting examples showing what iteration looks like versus wholesale delegation.

Place those examples in the mode-judgment context only, not the task-judgment context. This targeted use of few-shot keeps your token budget lean where examples do not help and invests it precisely where the decision is genuinely ambiguous. Generic, everywhere-the-same few-shot blocks are wasted spend.

What to deliberately leave out

Exclusion is as much a design act as inclusion. Leave out user identity and account attributes — they bias labels and have no bearing on what task occurred. Leave out prior conversations from the same user, because they tempt the model to label based on history rather than the conversation at hand. Leave out raw timestamps and channel metadata unless your taxonomy genuinely distinguishes by them. Each omission narrows privacy exposure and sharpens the judgment.

There is a useful test: for every item you consider adding to context, ask whether it changes which task or mode is correct. If it does not, it is noise, and noise in a classifier's context is a slow accuracy leak. The disciplined default is to exclude until proven necessary.

Ordering and emphasis within the window

Once you have decided what goes in, order still matters. Put the instructions and the output contract first, the taxonomy next, and the conversation summary last, closest to where the model generates its answer. This ordering keeps the decision space fresh in the model's attention right before it commits to a label, and it mirrors how the strongest classification prompts are laid out in practice. The summary is the thing being judged, so it should sit immediately before the judgment.

Emphasis is the other lever. State the single most important constraint — pick from the list or return unclassified, never invent — explicitly and early, and do not bury it under qualifications. Models follow instructions that are unambiguous and prominent far more reliably than ones hedged across three sentences. A classification prompt should read like a checklist the model cannot misinterpret, not like a memo it has to parse for intent.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Caching the stable parts of context

The taxonomy and the system instructions are identical across millions of conversations, while only the summary changes. That structure is made for prompt caching: mark the stable prefix — instructions plus taxonomy — as cacheable, and let only the summary vary per request. At high volume this meaningfully cuts both latency and cost, because the model does not re-process the unchanging vocabulary every single call.

The design implication is to keep the cacheable prefix genuinely stable. If you interleave per-conversation data into the taxonomy block, you break the cache and pay full price every time. Keep the boundary clean: a fixed, cacheable header of instructions and taxonomy, then the small variable summary at the end. This is one more reason to inject the taxonomy as a discrete block rather than weaving user-specific details through it.

Common pitfalls in context design

Dumping the full transcript. It is expensive, leaky, and dilutes signal. Summarize to exactly what the judgment needs.
Baking the taxonomy into a static prompt. You lose versioning and A/B ability. Inject it per request as structured data.
Adding user identity "for context." It biases labels toward stereotypes and adds privacy risk for zero accuracy gain. Leave it out.
Trimming the taxonomy under budget pressure. Cutting the decision space causes wrong labels. Trim the summary instead, always.
Uniform few-shot everywhere. Examples cost tokens; spend them only on the ambiguous mode judgment, not the well-defined task judgment.

Design your context in 5 steps

Reduce each conversation to a task-focused summary that encodes the action and the degree of human iteration.
Inject the current taxonomy version as a structured list in every request, kept separate from the system prompt.
Set a token budget and a trim rule that protects the taxonomy and compresses the summary.
Add two or three contrasting few-shot examples only to the augment-vs-automate context.
Exclude identity, history, and metadata unless an item provably changes the correct label.

Include or exclude?

Context item	Decision	Reason
Task taxonomy	Include (always)	It is the decision space
Conversation summary	Include (compressed)	Carries the signal cleanly
Mode few-shot examples	Include (mode call only)	Boundary is ambiguous
User identity / tier	Exclude	Biases labels, adds risk
Prior conversations	Exclude	Labels drift to history

Run every candidate item through this lens and your context stays lean, your accuracy stays high, and your privacy posture stays clean — the three things an Economic Index-style classifier lives or dies on.

Frequently asked questions

What should never go in a classification prompt?

Raw transcripts with PII, user identity or account tier, and prior conversation history. Each adds privacy risk and biases the label without improving the judgment about what task occurred and how.

Should the taxonomy go in the system prompt?

No — inject it per request as structured data. That lets you version it, A/B different label descriptions, and keep the system prompt stable while the taxonomy evolves.

When do few-shot examples help classification?

Mainly for the augment-vs-automate judgment, where the boundary is genuinely fuzzy. Task assignment against a well-described closed list usually needs no examples, so spend tokens only where they move accuracy.

What do I cut when I am over the token budget?

Trim the conversation summary, never the taxonomy. The taxonomy is the set of labels the model can choose from; cutting it guarantees wrong answers, while a shorter summary still carries the core signal.

Context-disciplined agents for voice and chat

CallSphere applies this same context discipline to voice and chat agents that answer every call, use tools mid-conversation, and book work around the clock — feeding Claude exactly the signal it needs and nothing more. See it at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.