Reusable Patterns for Classifying AI Work With Claude

The Anthropic Economic Index works because of a handful of repeatable engineering patterns, not magic. Once you have built one classification-and-aggregation pipeline, you start to see the same shapes everywhere: the closed-vocabulary prompt, the two-stage judgment, the context budget that keeps long conversations from blowing your token bill, the eval harness that catches taxonomy drift. This post is a pattern catalog. Each one is something you can lift directly into your own Claude-based measurement system.

Where the previous walkthroughs covered architecture and step-by-step build, this is about the code-level craft — how to structure prompts, tools, and context so the system stays accurate, cheap, and maintainable as your volume grows from thousands to millions of conversations.

Key takeaways

A closed-vocabulary prompt with an explicit "unclassified" escape is the single highest-leverage pattern.
Enforce schema with tool-use or a strict response contract, never free-text parsing.
Summarize long conversations before classifying to control context cost without losing signal.
Keep the augment/automate label as a separate, independently-evaluable judgment.
An eval set with frozen gold labels is what protects you from silent taxonomy drift.

Pattern 1: The closed-vocabulary classifier prompt

The foundational pattern is constraining the model to a fixed set of labels. The prompt names the vocabulary, forbids invention, and provides an escape hatch. Structure it so the taxonomy is data, not prose — passed as a list the model echoes an ID from. A reusable system prompt looks like:

You are a task classifier. You will receive a list of TASKS
(each with an id and description) and a CONVERSATION SUMMARY.
Return exactly one task id from the list, or "unclassified"
if none fits. Never invent an id. Output JSON only:
{"task_id": "...", "confidence": 0.0-1.0}

Three properties make this durable: the vocabulary is injected per-request (so you can version it), the escape hatch keeps precision high (the model is not forced to guess), and the confidence field lets you filter downstream. Every classification system in the Index family rests on a prompt shaped like this.

Pattern 2: Schema enforcement over parsing

Never regex a label out of prose. Use Claude's tool-use to define the output shape and let the model fill it. Define a single tool whose input schema is your judgment, and the model returns a clean, validated object. The classifier in mermaid form below shows where enforcement sits in the request lifecycle.

flowchart TD
  A["Conversation summary"] --> B["Build prompt + inject taxonomy"]
  B --> C["Call Claude with classify tool"]
  C --> D{"Valid tool output?"}
  D -->|No| E["Retry once, then 'unclassified'"]
  D -->|Yes| F["Validate confidence floor"]
  F --> G["Emit fact row"]

A tool definition makes the contract explicit. Here is the schema you would register so the model can only respond in a shape your pipeline accepts:

{
  "name": "classify_task",
  "description": "Assign the conversation to one task.",
  "input_schema": {
    "type": "object",
    "properties": {
      "task_id": {"type": "string"},
      "interaction_mode": {"enum": ["augmentation", "automation"]},
      "confidence": {"type": "number"}
    },
    "required": ["task_id", "interaction_mode", "confidence"]
  }
}

With a tool schema, malformed output is a validation failure you can retry, not a silent corruption that poisons your aggregates. This is the difference between a toy and a system you trust at scale.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Pattern 3: Summarize-then-classify for context economy

Long conversations are expensive to classify in full. The pattern is a two-pass approach: a cheap pass condenses the conversation into a few sentences capturing what task was performed and how, then the classifier reads only the summary. This caps your per-conversation token cost regardless of how long the original was, and it strips most PII as a side effect.

The trick is summarizing for the downstream judgment, not generically. Prompt the summarizer to preserve exactly the signal the classifier needs — the action taken and the degree of human iteration — and to drop everything else. A two-line, task-focused summary classifies as accurately as the full transcript at a fraction of the cost.

Pattern 4: Separate judgments, shared context

The task label and the augment/automate label answer different questions and should be independently evaluable, but they can share the same summarized context. The pattern is to compute the summary once, then issue two judgments against it. This keeps each label cleanly testable while avoiding the cost of re-summarizing. If you later need to retune the mode definition, you re-run only that judgment against cached summaries.

Resist the urge to derive the mode from the task — they are orthogonal. The same task ("draft a reply") can be augmentation for one user and automation for another, depending entirely on how they interacted. Treating them as separate dimensions is what gives the Index its most interesting findings.

Pattern 5: A frozen eval set against drift

Taxonomy classification silently degrades when you change prompts, models, or the taxonomy itself. The defense is a small, frozen set of gold-labeled conversations — a few hundred is enough — that you re-score on every change. If accuracy on the gold set drops, you caught the regression before it reached your dashboards. This eval harness is the most underrated pattern in the whole stack.

Store the gold set as (summary, expected_task_id, expected_mode) tuples and compute per-label precision and recall, not just overall accuracy. Per-label metrics reveal when a single category starts absorbing conversations it should not — the classic symptom of a vague task description that needs tightening.

Pattern 6: Confidence-gated human review

No classifier is perfect, and the honest pattern is to route the uncertain cases somewhere useful rather than pretending the label is final. Because every classification carries a confidence score, you can gate on it: high-confidence judgments flow straight to the fact table, while low-confidence ones are diverted to a review queue or simply marked unclassified. This keeps your aggregates clean — they reflect only judgments the model was sure about — and it surfaces exactly the conversations your taxonomy struggles with.

The diverted cases are gold for improving the system. A cluster of low-confidence conversations almost always means a missing or overlapping task label. Periodically reviewing that queue tells you where to split a category or add a new one, closing the loop between the classifier's uncertainty and your taxonomy's coverage. The Index's taxonomy is fixed by O*NET, but a product taxonomy you own should evolve, and this pattern is how you learn where it needs to.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Pattern 7: Model and prompt versioning on every fact

The subtle pattern that saves you months later is stamping each fact row with the model id and prompt version that produced it. When you upgrade from one Claude model to the next, or tweak a prompt, the labels can shift slightly — and if every fact looks identical, you cannot tell a real change in user behavior from an artifact of your own pipeline. Storing model_version and prompt_version alongside each label lets you segment trends by pipeline generation and rule out measurement drift.

This is cheap insurance. Two extra columns turn an ambiguous "did usage really change or did we just change the classifier?" into a query you can answer definitively. For any system meant to track change over time — which is the entire point of an Economic Index — provenance on every fact is non-negotiable.

Common pitfalls in the patterns

Open vocabularies. Letting the model name its own categories feels flexible and destroys comparability. Always inject a closed list.
Parsing prose instead of using tools. Free-text parsing breaks the moment the model phrases things differently. Use tool-use schemas.
Classifying full transcripts. You pay for tokens you do not need and risk leaking content. Summarize first.
Coupling task and mode. Deriving the interaction mode from the task throws away the most valuable signal. Judge them separately.
No eval set. Without frozen gold labels, you will not notice drift until a stakeholder asks why the numbers moved. Build the harness on day one.

Apply these patterns in 5 steps

Write the closed-vocabulary system prompt with an "unclassified" escape and a confidence field.
Register a tool/schema for the judgment so output is validated, not parsed.
Add a summarize-then-classify pass to cap context cost and strip PII.
Compute task and mode as separate judgments over the same cached summary.
Stand up a frozen gold eval set with per-label precision and recall, run on every change.

Pattern selection guide

Problem	Pattern	Why
Labels drift over time	Closed vocabulary + eval set	Comparable, regression-caught
Malformed outputs	Tool schema enforcement	Validated, retryable
High token cost	Summarize-then-classify	Caps per-item context
Weak mode signal	Separate judgments	Orthogonal, testable

Adopt them in that order of leverage. The closed vocabulary and the eval set together prevent the most expensive failure mode — quietly wrong numbers — and everything else is optimization on top.

Frequently asked questions

Why is a closed vocabulary so important?

Because comparability depends on stable labels. If the model can invent categories, your counts mean something different every run and trend lines become noise. A closed list with an "unclassified" escape keeps precision high and aggregates meaningful.

Should I use tool-use or JSON mode for the schema?

Tool-use gives you the strongest contract — the model fills a declared input schema, and invalid responses are validation failures you can retry. It is the more robust choice for high-volume classification.

How big should my eval set be?

A few hundred carefully labeled conversations is usually enough to catch regressions, provided they cover every label. Track per-label precision and recall, not just overall accuracy, so you spot a single category going wrong.

Can one prompt do both task and mode labels?

Yes, and a shared tool schema is a clean way to do it. Just keep the two judgments conceptually separate so you can evaluate and retune them independently.

Patterns in production on voice and chat

CallSphere uses these same agentic-AI classification patterns across voice and chat assistants that answer every call, invoke tools mid-conversation, and book work 24/7 — all measured with closed-vocabulary, schema-enforced labeling. See it at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.