Build a Claude Clinical Abstractor: Step-by-Step Guide
An engineer's walkthrough to build a Claude agent that abstracts source-attributed clinical data: schema, parsing, tool calls, quote verification, and evals.
You have read about evidence-bound abstraction in the abstract. Now let's build it. This is a concrete, do-it-in-order walkthrough for an engineer who wants Claude to read a clinical note and return structured, source-attributed fields you can actually trust. I will assume you are using the Claude Agent SDK or a direct API integration, and I will keep the example small enough to follow but real enough to extend.
The target: given a discharge summary, produce a record with principal diagnosis, comorbidities, and procedures — each value tied to the exact text that justifies it. We will build it in seven steps, and each step produces something you can run before moving on.
Step 1: Pin down the schema first
Do not start with a prompt. Start with the output schema, because in an abstraction system the schema is the spec. Define each element as an object that demands evidence: { "value": string, "icd10_code": string, "evidence_quote": string, "document_id": string, "confidence": number }. Notice there is no way to express a value without a quote. That single constraint does more for accuracy than any clever wording you could add later.
Write the schema as a JSON Schema object or a tool input schema, depending on how you will enforce it. Keep it strict: required fields, enumerated code systems, bounded confidence. You will hand this exact schema to Claude as the contract for a structured-output tool, so spend real time here.
Step 2: Ingest and section the note
Clinical notes have structure even when they look like a wall of text — Chief Complaint, History of Present Illness, Assessment and Plan, Discharge Diagnoses. Write a parser that splits the note into labeled sections and assigns each a stable id. You do not need machine learning for this; section headers are largely regular. Store each section with its id, label, and character offsets so you can map evidence quotes back to precise locations later.
The payoff comes at attribution time. When Claude cites a quote, you verify it against the parsed sections and recover its exact offset. If the quote does not appear verbatim in any section, you reject the value automatically — a cheap and powerful hallucination guard.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Discharge summary"] --> B["Parse into labeled sections"]
B --> C["Build per-element retrieval set"]
C --> D["Call Claude with schema tool"]
D --> E{"Quote found verbatim?"}
E -->|No| F["Reject & flag for review"]
E -->|Yes| G["Validate code & confidence"]
G --> H["Append to structured record"]Step 3: Assemble a focused context per element
For each element you want — say, principal diagnosis — select the sections most likely to contain it. The Assessment and Plan and the Discharge Diagnoses sections carry the diagnosis; you do not need the social history. Concatenate those sections, each prefixed with its section id, into a compact context block. This is lightweight retrieval, and for a single document it can be rule-based rather than vector-based. The discipline is the same as full RAG: give the model the smallest sufficient slice.
Keep the section ids visible in the context. When you ask Claude to cite evidence, you want it to reference real ids you can resolve, not invent locations. Visible ids make the citation step concrete.
Step 4: Define the abstraction as a tool call
Now wire the schema from Step 1 into a Claude tool definition. The tool — call it record_diagnosis — takes exactly the structured fields you defined. By forcing Claude to call this tool rather than free-text its answer, you get validated JSON and a natural place to reject malformed output. In your system prompt, state the abstractor's job and rules plainly: extract only what the text supports, quote the supporting sentence verbatim, and set low confidence when the documentation is ambiguous.
Resist the urge to overload the prompt with edge cases. Put durable abstraction rules — coding definitions, tie-breaking policy — into an Agent Skill or a referenced rulebook that Claude loads when relevant. The prompt stays short and the rules stay maintainable. We will lean on Claude Sonnet 4.6 for the routine extractions and reserve Opus 4.8 for elements your eval shows are error-prone.
Step 5: Verify every quote before you trust the value
This is the step engineers skip and then wonder why fabrication slips through. When Claude returns a value with an evidence_quote, do not take it on faith. Search for that quote in the parsed sections. If it is not present verbatim — allowing for trivial whitespace normalization — discard the value and route the element to human review. A model that paraphrased its evidence is a model that may have reasoned from text that does not exist.
This verification is deterministic code, not another model call, which keeps it fast and trustworthy. It is also where you recover the precise character offset for the quote, so your final record links each field to a clickable location in the source document.
Step 6: Score confidence and set the review threshold
Combine two signals into a final confidence: Claude's self-reported number and rule-based checks. If two retrieved sections disagree, demote confidence. If the cited code does not match the value's expected code family, demote. Then pick a threshold — start conservative, maybe 0.85 — above which the value auto-commits and below which a human reviews it. Calibrate the threshold against a labeled set rather than guessing; the right number depends entirely on your tolerance for error versus review cost.
Present low-confidence items to reviewers with the proposed value and the evidence quote already highlighted. Reviewers confirm or correct in seconds instead of re-reading the chart. Every correction is labeled data you fold back into your eval suite.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Step 7: Wrap it in an eval loop before you scale
Hold out a set of charts abstracted by humans and measure your agent against them: per-element accuracy, attribution validity (did the cited quote actually support the value), and the false-confident rate (high-confidence values that were wrong). The false-confident rate is the one that hurts in production, so watch it closely. Gate deployments on these metrics — no new prompt, rulebook, or model change ships unless the eval holds or improves.
With the loop in place you can iterate safely: tweak the rulebook, try routing more elements to Haiku for cost, swap a hard element up to Opus, and let the eval tell you whether each change was a win. That feedback loop is what turns a demo into a system.
Frequently asked questions
How long does it take to build a first working version?
A focused prototype handling a handful of element types is a few days of work — most of it spent on the schema, the section parser, and the quote-verification step rather than on prompting. The agent loop itself is small.
Should I extract all elements in one call or one per element?
For a handful of related elements, one call with a clear schema is efficient. As the element list grows or the reasoning per element diverges, split into focused calls so each gets a tight context and clean attribution. The smaller payloads also keep token use sane.
What model should I start with?
Start the whole pipeline on Claude Sonnet 4.6 to establish a baseline, then use your eval to find the elements where it slips and route only those to Opus 4.8. Push trivially structured fields down to Haiku 4.5 to cut cost.
How do I handle notes that genuinely lack the information?
Make "not documented" a valid, first-class outcome with its own low-confidence-but-clear status. Forcing a value when the chart is silent is how you manufacture errors; the abstractor's honest answer is sometimes that the element is absent.
Bringing agentic reasoning to your phone lines
The build pattern here — strict schema, verified evidence, confidence-gated review — is exactly what makes an AI safe enough for real-world stakes. CallSphere brings the same agentic-AI discipline to voice and chat, with assistants that pull the right data mid-call, act on it, and book work 24/7. See it working at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.