---
title: "Build a Verifiable Claude Finance Agent: Walkthrough"
description: "A step-by-step guide to building a verifiable financial-services agent on Claude: deterministic tools, evidence capture, a verifier loop, and policy gating."
canonical: https://callsphere.ai/blog/build-a-verifiable-claude-finance-agent-walkthrough
category: "Agentic AI"
tags: ["agentic ai", "claude", "financial services", "implementation", "agent sdk", "verifiable ai", "tutorial"]
author: "CallSphere Team"
published: 2026-04-30T08:23:11.000Z
updated: 2026-06-06T21:47:42.883Z
---

# Build a Verifiable Claude Finance Agent: Walkthrough

> A step-by-step guide to building a verifiable financial-services agent on Claude: deterministic tools, evidence capture, a verifier loop, and policy gating.

Architecture diagrams are easy to admire and hard to build from. This post is the opposite: a concrete, ordered walkthrough an engineer can follow to stand up a verifiable financial-services agent on Claude. We'll build a small but real example — an assistant that answers retirement-account questions like "how much can I contribute this year given my income and what I've already put in?" — and we'll wire it so that every number it states is backed by a tool result and checked before it reaches the user. By the end you'll have a mental build order you can apply to any regulated-finance use case.

The principle that guides every step: **build the verification path before you build the conversation.** It is tempting to get Claude chatting first and bolt on safety later, but in finance the evidence trail is the product. We construct the tools and the ledger first, then let the model orchestrate over them.

## Step 1: Define the deterministic tools first

Start by listing every fact your agent will need and turning each into a tool with a precise schema. For our example we need three: `get_account_summary` (returns balance, year-to-date contributions, and account type for an authenticated user), `contribution_limit` (takes age, income, account type, and tax year; returns the legal limit and the rule version used), and `lookup_rule` (retrieves the relevant regulation text by topic and year). Each tool returns structured JSON plus a `source_id` field. That `source_id` is non-negotiable; it is what later proves where a number came from.

Write these as plain functions first and test them in isolation. The `contribution_limit` tool should be pure arithmetic over a versioned rules table — no model involvement, fully unit-tested against known cases. When you can call each tool from a script and get correct, identical results every time, you have the trustworthy foundation the agent will stand on. Resist the urge to let Claude compute the limit "to save a tool call." That shortcut is exactly the kind of unverifiable behavior we are designing out.

## Step 2: Register the tools with Claude and write the system prompt

Now expose the tools to Claude via the Agent SDK's tool-use interface (or behind MCP servers, which we cover in a companion post). The system prompt establishes the rules of engagement: the agent must use tools for every fact, must never state a number it did not receive from a tool, must cite the `source_id` for each claim, and must say "I don't have that" rather than guess. Be explicit that contribution limits, balances, and rule text come only from the corresponding tools.

Keep the prompt focused on behavior, not data. Do not paste tax tables into the prompt — that's what the tool is for, and inlined data goes stale and can't be cited. Instead, describe the workflow: identify what the user is asking, gather the needed facts via tools, then compose an answer that attributes each fact. A tight, behavior-focused system prompt plus well-described tools gets you most of the way to reliable orchestration.

```mermaid
flowchart TD
  A["User question"] --> B["Claude plans tool calls"]
  B --> C["get_account_summary"]
  B --> D["contribution_limit"]
  C --> E["Append to evidence ledger"]
  D --> E
  E --> F["Claude drafts cited answer"]
  F --> G{"Verifier checks each claim"}
  G -->|Unsupported| B
  G -->|All backed| H["Policy gate + deliver"]
```

## Step 3: Capture evidence on every tool return

Wrap your tool-execution layer so that every call appends to an evidence ledger before the result goes back to the model. The entry records the tool name, the arguments, the full response, the `source_id`, and a timestamp, and it returns a short `evidence_id` that travels with the data. When Claude later writes "your remaining contribution room is $4,500," you instruct it to tag the sentence with the `evidence_id` of the `contribution_limit` result. This tagging is the link the verifier will check.

Implement the ledger as append-only storage — even a single table works to start — keyed by a run ID so an entire conversation's evidence can be pulled together. The discipline here pays off twice: at runtime it enables verification, and afterward it is your audit log. Make sure the raw tool response is stored verbatim, not a summary, because a summary you wrote is one more place a fact could drift from its source.

## Step 4: Build the verifier as a separate pass

Once Claude produces a draft with tagged claims, run a verification pass that is independent of the drafting step. For each fact-bearing sentence, the verifier pulls the referenced ledger entry and confirms the claim is actually supported by that data. You can implement the check two ways and should use both: a structured check that the cited number literally matches the tool output, and a Claude-as-judge check that reads the sentence and the evidence and rules "supported" or "not supported." The structured check catches transcription errors; the judge catches subtler misstatements like applying a rule to the wrong account type.

When the verifier rejects a claim, send the agent back to either fetch missing evidence or revise the wording — do not let it ship the unsupported sentence. Cap the number of revision loops so a stubborn case fails closed to a human handoff rather than spinning. In practice a one- or two-iteration loop resolves the large majority of issues, and the failures that remain are exactly the ones you want a person to see.

## Step 5: Gate on entitlements and disclosures, then deliver

The final step before delivery is the policy gate. Using the identity you established when the request arrived, confirm the user is entitled to every account referenced and attach any legally required disclosures — for instance, risk language whenever the answer edges toward a recommendation. The gate runs in code owned by compliance, not in the prompt, so it can be updated and audited independently of the agent's behavior. Log the gate's decision to the ledger so you can later prove what was shown and what was withheld.

Only after the answer passes verification and the gate does it reach the user. Wire up your handoff path now too: when verification fails repeatedly, when the gate blocks, or when the user asks something outside scope, the conversation routes cleanly to a human with the full evidence ledger attached. That context-rich handoff is what makes the human fast instead of frustrated, and it closes the loop on a system you can actually defend.

## Frequently asked questions

### How long does it take to build a first working version?

A focused single-use-case agent like the contribution example is a few days of work for one engineer once the source-of-record APIs exist: a day for the deterministic tools and tests, a day to wire Claude and the ledger, and a day for the verifier and gate. The long pole is almost always clean access to the underlying account and rules data, not the model integration.

### Should the verifier use the same model as the agent?

It can, but run it as a separate, single-purpose call with only the claim and its cited evidence in context — not the whole conversation. The narrow framing makes the judgment more reliable and cheaper. Pairing the model judge with a structured exact-match check gives you defense in depth without much added latency.

### What do I do when a tool itself is wrong or unavailable?

Fail closed. If `get_account_summary` errors or returns stale data, the agent should tell the user it can't answer right now rather than fall back to a guess, and the incident should be logged. Because facts only enter through tools, a tool outage degrades the agent into honest unavailability instead of confident fabrication — which is exactly the failure mode you want in finance.

## Bringing the same build pattern to live conversations

CallSphere ships this exact tool-first, verify-before-you-speak pattern into **voice and chat** agents that handle real customer calls — looking up accounts, checking policy, and booking next steps in real time. See a working version at [callsphere.ai](https://callsphere.ai).

---

*Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.*

---

Source: https://callsphere.ai/blog/build-a-verifiable-claude-finance-agent-walkthrough