Skip to content
Agentic AI
Agentic AI8 min read0 views

Build a Claude Contract Review Agent: Full Walkthrough

Step-by-step guide to building a Claude contract-review agent: MCP setup, retrieval, prompts, the tool loop, governance, and shipping safely.

Reading about agent architecture is one thing; standing one up is another. This is the build log — a concrete, ordered walkthrough an engineer can follow to ship a Claude contract-review agent that pulls a clause, compares it to a firm playbook, and returns a sourced redline note. No hand-waving. Each step names what you build, why it exists, and what breaks if you skip it. By the end you have a working loop you can extend into discovery, due diligence, or intake.

The target is deliberately narrow: an agent that, given a contract and a matter, identifies risky clauses against a standard and explains the risk with citations. Narrow scope is the single best decision you can make on day one. A focused agent that does one thing reliably earns trust; a sprawling 'legal copilot' that does ten things poorly never ships past the pilot.

Step 1: Stand up the document store as an MCP server

Start with where the contracts live. Wrap your document store — whether that is S3, a DMS, or a Postgres table of clause chunks — behind a Model Context Protocol server. The server exposes a small, deliberate set of tools: get_document by ID, search_clauses by semantic query, and list_documents filtered by matter, type, and date. Keep each tool's input schema tight and typed. Claude reasons better against a handful of clear tools than against a sprawling, ambiguous API surface.

The critical detail here is what each tool returns. Every clause must come back with a stable identifier — document ID plus paragraph or section reference — alongside the text. This provenance is what makes the agent's output verifiable later. If your store cannot return a citable reference, fix that before writing a line of agent code, because retrofitting provenance into a finished agent is painful and you will be tempted to skip it.

Step 2: Build the retrieval layer with two paths

Now make the contracts findable. Index your clauses two ways. For semantic search, chunk each contract at the clause level — not by fixed token windows, which slice clauses in half — embed each chunk, and store the embeddings with their document references. For structured lookup, tag every document with type, parties, governing law, and effective date so the agent can fetch deterministically. Expose both through the MCP server you just built.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

When you chunk, preserve the clause heading and the enclosing section as metadata. A limitation-of-liability clause read in isolation is ambiguous; the same clause tagged with its section and the contract's definitions becomes interpretable. This metadata is also how Claude decides which retrieval path to use — it can see from a structured listing whether a deterministic fetch will answer the question before spending a semantic search.

flowchart TD
  A["Contract + matter in"] --> B["Chunk at clause level"]
  B --> C["Embed + tag metadata"]
  C --> D["Index: semantic + structured"]
  D --> E["MCP server exposes search tools"]
  E --> F{"Claude: which path?"}
  F -->|Concept match| G["search_clauses"]
  F -->|Known type/date| H["list_documents + get_document"]
  G --> I["Return clauses with pin-cites"]
  H --> I

Step 3: Write the system prompt and load the playbook

The system prompt is where the firm's judgment lives. It tells Claude its role — a contract-review assistant for licensed attorneys — its standards, and its hard rules: cite every clause you reference, never assert a fact without a source, frame conclusions as analysis for attorney review. Keep the prompt declarative and specific. Vague instructions like 'be careful' produce careless output; concrete rules like 'every flagged clause must include the document ID and paragraph number' produce verifiable output.

The playbook — your firm's standard positions on indemnification, liability caps, termination, and so on — does not belong hard-coded in the prompt. Load it through retrieval so it can change without redeploying the agent. The system prompt instructs Claude to compare each contract clause against the retrieved standard and explain any deviation. This separation keeps the prompt stable while the legal standards stay editable by the lawyers who own them.

Step 4: Wire the tool-use loop

Now the engine. Send Claude the system prompt, the contract reference, and the user's request. Claude responds with either an answer or a tool call. If it calls a tool, execute it against the MCP server, append the structured result to the conversation, and call Claude again. Repeat until Claude returns a final answer with no further tool calls. This loop — call, execute, append, recall — is the entire agent runtime in about forty lines of orchestration code.

Two implementation notes save you grief. First, cap the loop. Set a maximum number of tool iterations so a confused agent cannot spin forever; if it hits the cap, return what it has with a flag for human review. Second, validate every tool result before appending it. A tool that errors should return a structured error message Claude can reason about — 'document not found' — not a raw stack trace that derails the conversation. Claude handles clean error signals gracefully and will often retry with a corrected query.

Step 5: Add the governance gate before returning

Before any answer reaches the user, run it through a governance check. At minimum: confirm the requesting user is entitled to the matter, verify that every clause the answer cites actually appears in the retrieved sources, and ensure the output is framed as analysis rather than advice. The entitlement check should ideally run before retrieval too, but the output check is your last line of defense against a fabricated citation reaching a document that goes to a court or a counterparty.

A practical way to implement the citation check is to have the agent emit its answer as structured output — a list of findings, each with the clause text and its source reference — rather than free prose. Your governance code then mechanically confirms each referenced source exists in the turn's retrieval results. If a finding cites a source that was never retrieved, the agent hallucinated it, and you block the response. This single check eliminates the most dangerous failure mode in legal AI.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6: Test against real contracts and ship narrow

Before production, build an evaluation set from real, anonymized contracts where a lawyer has already marked the risky clauses. Run the agent and compare. You are measuring two things: does it find the deviations a human found (recall), and does it avoid flagging things that are not problems (precision). A legal agent that cries wolf gets ignored; one that misses a missing liability cap is worse. Tune the prompt and retrieval against this set until both numbers satisfy the attorneys who will rely on it.

Ship to a small group first. Give three attorneys the agent for their real contract reviews, watch where it fails, and fix the retrieval or prompt accordingly. An agentic contract-review system is software that combines a language model, document retrieval, and a tool loop to identify and explain contractual risk against a defined standard, under human supervision. The walkthrough above is the minimum viable version of exactly that — and it is enough to earn the first real users.

Frequently asked questions

How long does it take to build this?

A focused engineer can stand up the document MCP server, dual-path retrieval, the tool loop, and a basic governance gate in one to two weeks for a single contract type. The longer work is the evaluation set and the iteration with attorneys, which is where quality actually comes from.

Should the playbook be in the prompt or in retrieval?

In retrieval. Hard-coding standards in the system prompt means every change requires a redeploy and couples engineering to legal policy. Loading the playbook through a tool lets the lawyers who own those positions edit them directly, and keeps the prompt stable.

What stops the agent from inventing a citation?

The structured-output governance check. By having Claude emit findings with explicit source references and mechanically verifying each reference against that turn's retrieved documents, you block any answer that cites a source the agent never actually saw. This is the single most important safety control in the build.

From the redline to the phone line

The same build pattern — tools behind MCP, a tight reasoning loop, a governance gate before output — powers agents far beyond contracts. CallSphere uses it for voice and chat, with assistants that answer calls, query your systems live, and schedule work day and night. See it working at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.