---
title: "Claude Clinical Abstraction: A Full Use-Case Walkthrough"
description: "An end-to-end build of a Claude agent that abstracts oncology pathology reports — gold set, MCP, skills, eval gate, and human loop, step by step."
canonical: https://callsphere.ai/blog/claude-clinical-abstraction-a-full-use-case-walkthrough
category: "Agentic AI"
tags: ["agentic ai", "claude", "clinical abstraction", "use case", "mcp", "agent skills", "healthcare ai"]
author: "CallSphere Team"
published: 2026-04-08T17:46:22.000Z
updated: 2026-06-06T21:47:43.789Z
---

# Claude Clinical Abstraction: A Full Use-Case Walkthrough

> An end-to-end build of a Claude agent that abstracts oncology pathology reports — gold set, MCP, skills, eval gate, and human loop, step by step.

Let me walk through one concrete project end to end, because the abstract advice about evals and grounding only makes sense once you see it applied to a real, messy problem. The setup: a hospital cancer registry receives thousands of pathology reports a year, and a small team of certified abstractors manually pulls primary site, histology, stage, and biomarker status from each one. The backlog runs weeks long. The goal was to use Claude as a first-pass abstractor so humans review and correct rather than start from a blank field. This is the kind of narrow, high-value, well-bounded task where agentic AI earns its keep.

What follows is the actual sequence — problem framing, data plumbing, the agent build, the eval gate, the human loop, and what shipping looked like. Nothing here is hypothetical architecture; it is the order of operations that repeatedly works, and the places where teams trip if they skip a step.

## Step one: frame the task so success is checkable

The project did not start with a prompt. It started with a spreadsheet. The team took 300 already-abstracted reports — records where certified humans had produced the structured answer — and treated those human answers as the gold standard. This gold set is the most important artifact in the entire project, because it converts "is Claude good at this?" from an opinion into a measurement. Without it, every conversation about quality dissolves into anecdote.

Framing also meant narrowing scope ruthlessly. The first version abstracted only four fields, not the forty the registry ultimately needs. Four fields meant the team could build a tight eval, get a real signal fast, and learn the failure modes before scaling surface area. The lesson generalizes: pick the smallest slice that is still genuinely useful, prove it, then widen. Teams that try to abstract everything at once never get a clean read on whether anything works.

## Step two: plumb the data with MCP and de-identification

With the task framed, the next job was getting reports to Claude safely. The reports lived in a document store, so the team stood up a Model Context Protocol server that exposed a single scoped tool: fetch a specific report by ID and return its text sections. Crucially, the MCP layer ran de-identification first, stripping direct identifiers before any text reached the model or the logs. The agent never needed the patient's name to abstract a histology code, so it never saw it. Least privilege, applied to the model itself.

```mermaid
flowchart TD
  A["Pathology report in doc store"] --> B["MCP server: de-identify & fetch"]
  B --> C["Claude + abstraction skill"]
  C --> D["Structured fields + source citations"]
  D --> E{"Eval gate vs gold set"}
  E -->|Below bar| F["Fix skill, re-run"]
  E -->|Passes bar| G["Human abstractor reviews flagged cases"]
  G --> H["Approved record to registry"]
  F --> C
```

The diagram captures the shape of the shipped system, but the order it was built in matters as much as the topology. The MCP server and de-identification came before any serious prompting, because nobody wanted PHI sloshing through experimental logs. Getting the data path clean and auditable first meant every later experiment was safe to run freely, which dramatically sped up iteration. Teams that bolt on privacy at the end instead pay for it with a frozen, fearful iteration loop.

## Step three: build the agent with a skill, not a mega-prompt

Rather than cram every abstraction rule into one giant prompt, the team wrote an Agent Skill: a structured set of instructions for how to read a pathology report, where each field typically appears, how to handle the clinical-versus-pathologic staging distinction, and how to treat certainty language. The skill also mandated the output contract — return each field with the exact source span that justifies it. This made the agent's reasoning auditable from day one, because every value pointed back at the words that produced it.

The early runs were instructive. Claude nailed primary site and histology almost immediately but stumbled on stage in exactly the neoadjuvant-therapy edge case predicted earlier — choosing clinical stage when the registry rule wanted pathologic. Because the gold set caught it and the citations showed the reasoning, the fix was surgical: a few precise lines added to the skill about post-treatment staging. No retraining, no model change. That tight loop between a failing eval case and a skill edit is what makes Claude-based abstraction practical to maintain.

## Step four: gate on evals, then design the human loop

The eval gate ran every change against the 300-report gold set and reported per-field agreement. The team set a rule: no version ships unless each field meets its agreement bar, and high-stakes fields like stage carry a higher bar than low-stakes comments. When a version passed, it did not go to full autonomy. It went to a human-in-the-loop workflow where abstractors saw Claude's pre-filled answer plus its citations and either approved or corrected each field. Pre-filled-and-reviewed is far faster than abstracting from scratch, which is where the actual time savings came from.

The human corrections were not thrown away. Each correction became a new labeled example, expanding the gold set and revealing patterns — if abstractors kept fixing the same field the same way, that was a signal to update the skill. Over a few cycles the override rate on the easy fields dropped to near zero, and humans concentrated their attention on the genuinely ambiguous cases. The system did not replace abstractors; it moved their effort from rote extraction to judgment, which is the only place human time is well spent.

## What shipping actually looked like

Production was deliberately boring. The agent ran as a batch job overnight, pre-filling the next day's queue so abstractors arrived to reviewed-ready records instead of raw reports. Monitoring tracked per-field agreement, override rate, and blocked-extraction rate, with alerts if any drifted. When the registry later changed a coding guideline, the override rate on the affected field ticked up within days, the team updated the skill, and the metric recovered — exactly the closed loop the design intended. The backlog that once ran weeks compressed substantially, not because Claude was perfect, but because it turned a blank-page task into a review task.

The transferable lesson is the sequence: gold set first, clean and de-identified data path second, skill-based agent third, eval gate fourth, human loop fifth, monitoring last. Each step earns the right to the next. Skip the gold set and you cannot measure; skip the citations and you cannot audit; skip the human loop and you cannot ship safely. Followed in order, a narrow clinical-abstraction agent goes from messy problem to trusted production workflow in a manageable number of cycles.

## Frequently asked questions

### How many fields should a first version abstract?

As few as are still genuinely useful — often just three or four high-value fields. A narrow scope lets you build a tight gold-standard eval, get a real accuracy signal quickly, and learn the failure modes before expanding. Trying to abstract dozens of fields at once makes it nearly impossible to tell what is working and slows the whole project down.

### Why use an Agent Skill instead of one large prompt?

A skill keeps abstraction rules structured, versioned, and editable, so fixing a specific failure means adding a few precise lines rather than rewriting a monolithic prompt. It also lets you encode the output contract — every field with its source citation — in one place. When an eval catches an error, a skill edit is a surgical, low-risk fix that does not disturb the rest of the agent's behavior.

### Where do the time savings actually come from?

From converting a blank-page task into a review task. Abstractors approving or correcting Claude's pre-filled, citation-backed answers work far faster than abstracting each report from scratch. The model does not need to be perfect; it needs to produce a good, auditable first pass so humans spend their time on judgment about ambiguous cases rather than on rote extraction.

### What happens when coding guidelines change after launch?

Monitoring catches it. When a guideline shifts, the human-override rate on the affected field rises within days. That signal triggers a skill update, you re-run the eval gate against the gold set, and the metric recovers. The closed loop of monitoring, skill edit, and re-evaluation is what keeps the system accurate as the real world changes underneath it.

## From abstracted records to answered calls

This same build sequence — measure first, ground every answer, keep a human in the loop — is what turns any agent from demo to dependable. CallSphere applies it to **voice and chat**, with assistants that answer every call, pull live data mid-conversation, and book work 24/7. See the full workflow at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/claude-clinical-abstraction-a-full-use-case-walkthrough