Build an Economic Index-Style Pipeline With Claude
Step-by-step: classify agent conversations against a task taxonomy with Claude, label augment vs. automate, and aggregate safely — like the Anthropic Economic Index.
Reading the Anthropic Economic Index sparks an obvious engineering question: could I build the same thing for my own agent traffic? You have thousands of Claude conversations flowing through your product, and somewhere in them is a story about which tasks your users actually delegate, which they collaborate on, and where the value concentrates. The Index proves the method works at scale. This walkthrough shows you how to stand up a small version of that pipeline end to end, with code you can run today.
We will build it in concrete stages: ingest a conversation, classify it against a task taxonomy with Claude, attach an augmentation/automation label, push everything through an aggregation step that protects privacy, and emit a roll-up. By the end you will have a working skeleton and a clear map of where to harden it for production.
Key takeaways
- You can replicate the Index method on your own agent logs without exposing raw conversations to anyone.
- Claude acts as the classifier — give it a closed taxonomy and force structured JSON output.
- The augment-vs-automate label is a second, separate judgment, not a byproduct of classification.
- Privacy is enforced by thresholding small buckets before anything is written to a human-visible table.
- One fine-grained fact table feeds every report, so new questions never re-run classification.
Step 1: Define your task taxonomy
Before any code, decide what you are classifying into. The Index uses O*NET; for a product you usually want your own catalog of the tasks your agent performs. Keep it closed and finite — twenty to two hundred labels — because a closed set is what makes counts comparable. A snippet of a taxonomy for a support agent might be:
TASKS = {
"T01": "Diagnose a failing integration",
"T02": "Draft a customer-facing reply",
"T03": "Reconcile a billing discrepancy",
"T04": "Summarize a long ticket thread",
"T05": "Escalate to a human specialist"
}
The discipline here is that the model must pick from this set or return a sentinel "unclassified" — it never invents a category. That single rule is what keeps your aggregates honest three months later when you compare them across time.
Step 2: Classify each conversation with Claude
Now feed a redacted conversation summary to Claude and demand a structured judgment. Use a tool/JSON-mode call so the output is machine-parseable. The request body looks like this:
POST /v1/messages
{
"model": "claude-haiku-4-5",
"max_tokens": 300,
"system": "You are a labeler. Pick exactly one task_id from the provided list or 'unclassified'. Return only JSON.",
"messages": [{
"role": "user",
"content": "TASKS: {...}\n\nCONVERSATION SUMMARY: user asked the agent to find why a webhook stopped firing; agent inspected logs and proposed a fix the user then applied."
}]
}
Use a fast, cheap model here — Haiku 4.5 is ideal because classification is high-volume and the schema is tight. The expected response is a single object: {"task_id":"T01","confidence":0.88}. Drop anything below a confidence floor so noise never reaches your tallies.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3: Wire the stages together
With classification working, the rest is plumbing. The flow below is the runtime shape of the pipeline — each conversation passes through redaction, classification, mode-labeling, and aggregation, and only aggregates ever land in the reporting store.
flowchart TD
A["Raw conversation"] --> B["Redact PII"]
B --> C["Claude: classify task_id"]
C --> D{"confidence >= 0.6?"}
D -->|No| E["Drop"]
D -->|Yes| F["Claude: augment vs automate"]
F --> G["Append to fact table"]
G --> H["Nightly aggregate + threshold"]
H --> I["Reporting view"]
Notice that two separate Claude calls appear: one for the task, one for the interaction mode. You could fuse them into a single prompt to save tokens, but keeping them separate makes each judgment easier to evaluate and re-run independently when you tune prompts. For a first build, separate is clearer.
Step 4: Add the augment-vs-automate label
This second judgment is what makes the output interesting. A conversation where the user delegates a whole task and accepts the result is automation; one where they iterate, correct, and co-author is augmentation. Prompt Claude with a crisp definition and two or three examples, and return a single enum:
{
"interaction_mode": "augmentation",
"rationale": "user revised the agent's draft twice before sending"
}
Keep the rationale short and never persist it to a human-visible table — it can leak content. Store only the enum in your fact rows; the rationale exists for spot-checking during development and should be discarded after.
Step 5: Aggregate behind a privacy threshold
The fact table holds one row per conversation: (date, task_id, interaction_mode) and nothing identifiable. The nightly job groups by task and mode, counts, and — critically — suppresses any group with fewer than k rows (k=20 is a reasonable floor). Only the suppressed, counted output is written to the view your team queries. The privacy guarantee lives in this step, so test it first and test it hard.
From here, every report is a roll-up. Map each task_id to a higher-level category and sum to get department-level views. Compute the augmentation ratio per task as a single division. Because you stored facts at the finest grain, none of these views require touching Claude again.
Scaling from hundreds to millions
The skeleton above runs fine on a laptop against a few hundred conversations, but the Index operates at a vastly larger scale, and a few changes make your version scale with it. The first is batching: instead of one API call per conversation in a tight loop, group summaries and classify them with concurrency, respecting your rate limits. Classification is embarrassingly parallel because each conversation is independent, so throughput is bounded only by your concurrency settings and budget, not by any ordering constraint.
The second change is decoupling ingestion from classification with a queue. Conversations land on a queue as they finish; a pool of workers pulls them, classifies, and appends facts. This isolates spikes — a busy hour fills the queue rather than overwhelming the model — and it makes retries trivial, since a failed message simply goes back on the queue. Crucially, because your writes are idempotent on a stable id, a re-processed message never double-counts, which is exactly the property you need when workers can crash mid-flight.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The third change is cost observability. At volume, classification spend is real money, so track tokens per conversation and tasks-classified-per-dollar as first-class metrics. If a prompt change quietly doubles input tokens, you want a dashboard to catch it the same day, not a surprise invoice at month end. Cheap models plus tight summaries keep this number small, but only measurement keeps it honest.
Common pitfalls in the build
- Letting the model invent labels. If you do not force a closed set, your taxonomy drifts and month-over-month comparisons become meaningless. Always include an "unclassified" escape hatch instead.
- Skipping redaction before classification. If raw PII reaches the model prompt or your logs, your privacy story collapses. Redact first, classify second.
- Persisting rationales. The free-text rationale is useful in dev and dangerous in prod. Keep it ephemeral.
- Aggregating without a threshold. A group of two conversations can re-identify a user. Suppress small buckets every single run.
- Using your biggest model for classification. This is a volume task with a tight schema — Haiku-class models are faster and cheaper and rarely less accurate at picking from a closed list.
Ship it in 6 steps
- Write your closed task taxonomy as a versioned config file.
- Build a redaction pass that strips names, emails, and IDs from conversation summaries.
- Stand up the classification call with strict JSON output and a confidence floor.
- Add the second call for the augment/automate label, persisting only the enum.
- Write each result as one fine-grained row to an append-only fact table.
- Run a nightly aggregate that thresholds small groups and populates the reporting view.
Single call vs. two calls
| Aspect | One fused call | Two separate calls |
|---|---|---|
| Token cost | Lower | Higher |
| Eval & tuning | Harder to isolate | Each judgment testable |
| Re-run a label | Re-runs both | Re-run just one |
| Best for | Mature, cost-sensitive | First build, iterating |
Start with two calls while you are still tuning, then fuse them once the prompts stabilize and cost matters more than flexibility.
Frequently asked questions
Which Claude model should I use for classification?
A fast, inexpensive model like Haiku 4.5. Classifying into a closed list with a strict schema is a high-volume, low-reasoning task, so paying for a frontier model rarely improves accuracy and dramatically raises cost.
How do I keep this from leaking user content?
Redact before the model ever sees the text, store only enums and IDs in your fact table, discard free-text rationales, and threshold small groups before anything reaches a human-visible report.
Do I need O*NET specifically?
Only if you want labor-market comparability like the official Index. For internal product analytics, your own closed task catalog is usually more actionable and easier to maintain.
How often should I re-run classification?
Classify each conversation once as it arrives and append the fact. Reports are roll-ups over the existing facts, so you only re-classify when you change the taxonomy or prompt.
Measuring agentic AI on live conversations
CallSphere runs these same agentic-AI patterns over voice and chat — assistants that handle every call, call tools mid-conversation, and book work around the clock, all measurable with a pipeline like the one above. See it at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.