---
title: "Building a dispute-resolution agent with Claude"
description: "End-to-end walkthrough of shipping a verifiable card-dispute agent on Claude — from scoping to shadow mode to audited production rollout."
canonical: https://callsphere.ai/blog/building-a-dispute-resolution-agent-with-claude
category: "Agentic AI"
tags: ["agentic ai", "claude", "financial services", "use case", "dispute resolution", "verifiable ai", "agent sdk"]
author: "CallSphere Team"
published: 2026-04-30T17:46:22.000Z
updated: 2026-06-06T21:47:42.949Z
---

# Building a dispute-resolution agent with Claude

> End-to-end walkthrough of shipping a verifiable card-dispute agent on Claude — from scoping to shadow mode to audited production rollout.

Abstract advice about verifiable AI only goes so far. So let us build one, start to finish. The task: a card-dispute resolution agent for a mid-size issuer, built on Claude, that has to triage and act on customer disputes accurately enough to satisfy both the operations team drowning in volume and the compliance team that will be audited on every adverse decision. This is a real shape of problem in financial services, and walking through it concretely shows where the verifiability work actually lives — which is rarely where newcomers expect.

## Day one: the problem nobody scoped correctly

The stated problem was "automate dispute handling." The real problem, once we sat with the ops team, was narrower and more interesting. Of the disputes coming in, roughly two-thirds were routine and well-understood — duplicate charges, clear merchant errors, cancelled subscriptions still billing. The remaining third were genuinely ambiguous and carried regulatory weight: cases where a wrong decision means a Regulation E violation or an unhappy customer who escalates. The agent's job was not to decide everything. It was to confidently and correctly handle the routine two-thirds, and to recognize the hard third and route it to a human with a clean summary.

That reframing changed everything downstream. We were not building a decision engine; we were building a triage-and-act system whose most important skill was knowing what it did not know. The success of the project hinged on the agent's calibration far more than its raw accuracy.

## Designing the agent: tools, skills, and structure

We built it on the Claude Agent SDK with a deliberately small tool surface exposed through Model Context Protocol servers: a transaction-lookup tool, a customer-history tool, a merchant-reputation tool, and a single action tool — propose-resolution — that could not itself execute anything. Every consequential action went through proposal, never direct execution. The agent's resolutions were proposals that either auto-approved under a value threshold or queued for a human.

The dispute policy itself lived as an Agent Skill — a folder of the issuer's actual dispute-handling rules, reason-code definitions, and worked examples that Claude loaded when relevant. This mattered for verifiability: when the agent applied a rule, it cited the specific section of the policy skill, and we could trace any decision back to the exact rule text it relied on. The policy was version-controlled, so we always knew which version of the rules produced which decision.

```mermaid
flowchart TD
  A["Dispute arrives"] --> B["Agent pulls transaction & customer history"]
  B --> C{"Routine pattern matched?"}
  C -->|No| D["Summarize & escalate to human"]
  C -->|Yes| E["Apply policy rule, emit reason code"]
  E --> F{"Resolution value under threshold?"}
  F -->|No| D
  F -->|Yes| G["Auto-resolve & write audit record"]
  D --> H["Human decides, feedback to eval set"]
```

## The eval set that made it shippable

The artifact that turned this from a prototype into a production system was the evaluation set. We assembled 1,500 historical disputes with known correct outcomes, deliberately oversampling the hard and ambiguous cases. Every change to the prompt, the policy skill, or the tools ran against this set, and we tracked two numbers separately: accuracy on the routine cases the agent was allowed to auto-resolve, and — more importantly — the escalation precision, meaning how often a case the agent auto-resolved should actually have been escalated.

That second number was the real gate. A false auto-resolution — confidently handling a case that needed a human — was the failure we could not tolerate, because it is the one that becomes a compliance finding. We tuned the system to be aggressively cautious: when in doubt, escalate. Early versions escalated too much, which annoyed ops, but that was the safe direction to start and loosen from, exactly as risk discipline demands.

## Shadow mode: running without consequences

Before the agent touched a single real resolution, we ran it in shadow mode for three weeks. It processed every incoming dispute in parallel with the human team, produced its proposed resolution, and wrote it to a log — but its proposals did nothing. We then compared, every day, where the agent and the humans agreed and disagreed. Disagreements were gold: each one was either a real agent error we needed to fix, or a case where the agent was actually right and a human had been sloppy, or a genuinely ambiguous case where reasonable parties differ.

Shadow mode did two things at once. It built the evidence the model-risk committee needed to approve a real rollout — concrete agreement rates on real traffic, broken down by dispute type — and it grew the eval set, because every interesting disagreement became a new labeled test case. By the end of shadow mode we trusted the system not because it was clever but because we had watched it handle three weeks of real volume without authority to do harm.

## Rollout: narrow, then widen

We did not flip a switch. The first production deployment let the agent auto-resolve exactly one dispute category — clear duplicate charges under a low dollar threshold — and escalate everything else. That category was chosen because it was unambiguous, low-risk, and high-volume, so it delivered real relief to ops while exposing almost no blast radius. We watched it for two weeks, confirmed the audit trail was clean and the resolutions held up, then added the next category.

Each new category went through the same gate: strong eval performance, clean shadow-mode agreement, model-risk sign-off, narrow rollout, monitoring, widen. Three months in, the agent auto-resolved a large majority of routine disputes, the ops team spent its time on the genuinely hard cases plus auditing a sample of the agent's auto-resolutions, and the compliance team had a complete reconstructable record of every decision. That record — not the automation rate — was what made the project a success in the eyes of the people who could have killed it.

## What we would do differently

Two lessons stand out. First, we underinvested early in the audit-record schema and had to retrofit fields once compliance reviewed it; design the record format with the auditor in the room from day one. Second, we initially treated escalations as failures to minimize, which created pressure to auto-resolve marginal cases. Reframing escalation as a correct, valuable outcome — the agent doing its job of knowing its limits — fixed the incentive and made the whole system safer.

## Frequently asked questions

### How long did this realistically take to ship?

From first scoping to the first narrow production category was about ten weeks, with most of that time spent on the eval set and shadow run rather than building the agent itself. The agent logic was a small fraction of the effort; the verification scaffolding was the bulk of it, which is typical for finance.

### Why propose-and-approve instead of letting the agent act directly?

Because it makes blast radius a dial you control. Proposals let you run the full agent in production while keeping a human gate on anything risky, then progressively auto-approve the categories that have earned trust through evals and shadow data. Direct action removes that gradient and forces an all-or-nothing trust decision.

### What made the compliance team comfortable?

The reconstructable audit trail and the shadow-mode evidence. They could pull any decision and see the inputs, the policy version, the rule applied, the reason code, and the approval path. Verifiability was not a feature we added at the end; it was the spine the whole system was built around.

## Bringing agentic AI to your phone lines

The same triage-then-act pattern works when the customer is on the phone. CallSphere builds voice and chat agents that resolve the routine, escalate the hard, and log every step. See a live walkthrough at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/building-a-dispute-resolution-agent-with-claude