Skip to content
Agentic AI
Agentic AI6 min read0 views

Building a Cited Support Agent on Claude: A Walkthrough

A real end-to-end build of a Claude support agent that cites every answer — from messy knowledge base to shipped, monitored, span-grounded system.

Most citation tutorials stop at "here's a prompt that adds [Source 1]." Real projects don't stop there — they start there and then spend three weeks discovering why the citations are wrong. This is a full walkthrough of one realistic build: a customer-support agent on Claude that has to answer billing and policy questions and cite the exact help-center article behind every claim. I'll take it from the messy starting problem to a shipped, monitored system, including the parts that broke.

Key takeaways

  • The hard part isn't the prompt — it's turning a messy knowledge base into citable evidence with clean provenance.
  • A naive first prototype will cite confidently and wrongly; the fix is span-level retrieval plus a verification pass.
  • Shipping requires an abstention path and a human-handoff path, not just an answer path.
  • You'll ship faster by encoding the citation contract as an Agent Skill reused across agents.
  • The last 20% — monitoring unsupported-claim rate — is what keeps the system trustworthy after launch.

The starting problem

The brief was simple to state and hard to deliver: "Customers ask billing questions in chat. Answer them accurately, and show which help article each answer comes from, so agents and customers can trust it." The knowledge base was 400 help articles in a CMS, many overlapping, some contradicting each other, none with stable IDs or reliable publish dates. That last detail — no provenance — is where the real work hid.

From documents to citable evidence

Before any Claude call, we had to make the corpus citable. That meant exporting each article with a stable ID, title, URL, and last-updated date, then chunking on semantic boundaries — one chunk per distinct policy statement — so a citation would land on a coherent claim instead of half a paragraph.

flowchart TD
  A["400 help articles"] --> B["Export with ID, URL, date"]
  B --> C["Chunk on policy boundaries"]
  C --> D["Embed & index spans"]
  D --> E["User question"]
  E --> F["Retrieve top spans"]
  F --> G["Claude: answer + cite span IDs"]
  G --> H{"Auditor: spans support claims?"}
  H -->|No| I["Abstain or hand to human"]
  H -->|Yes| J["Reply with clickable citations"]

The indexing step stored, for every chunk, both the text and its metadata, so a returned span could render as a real clickable link with a date stamp. Provenance that the corpus owner preserved up front is what made cited answers possible downstream.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The first prototype, and why it lied

The naive version retrieved whole articles and asked Claude to "answer and cite." It worked beautifully in demos and failed in testing. The model would retrieve a relevant-looking article, synthesize a plausible answer, and cite the article — even when the article didn't actually contain the specific fact. Classic faithfulness failure: a real citation attached to an unsupported claim.

Two fixes turned it around. First, retrieve and cite spans, not whole articles, so the model had to point at the exact sentence. Second, add an independent auditor pass that checked each claim against its cited span. Here's the grounding instruction that shipped, encoded once as a Skill so every support agent inherited it:

RULES for grounded support answers:
- Cite the exact span: [HC-1042#s3].
- Quote no more than needed; paraphrase but stay faithful.
- If no span supports the answer, reply:
  "I can't confirm that from our help center — connecting you to an agent."
- Never invent policy. Never merge two policies into one claim.
- If spans conflict, say so and escalate.

The escalation line is doing quiet heavy lifting: it converts "I don't know" from an embarrassment into a clean human handoff, which customers actually preferred to a guess.

Common pitfalls we hit (so you don't have to)

  • Citing articles, not spans. Article-level citations let the model hide unsupported claims inside relevant documents. Always cite the smallest supporting span.
  • No stable IDs. Our first export used titles as keys; an editor renamed an article and every citation broke. Use immutable IDs from day one.
  • Ignoring contradictions in the corpus. Two articles gave different refund windows. The agent confidently cited one until we forced conflict detection. Clean the corpus or teach the agent to surface the clash.
  • Demo-driven confidence. The prototype looked perfect on five hand-picked questions. We only caught the faithfulness failures with a 200-question eval set. Build the eval before you trust the demo.
  • Forgetting freshness. A cited answer from an outdated article is still wrong. We added a last-updated badge to every citation and a TTL that flags stale spans.

Ship a cited agent in five steps

  1. Export your corpus with stable IDs, URLs, and dates — fix provenance before anything else.
  2. Chunk on claim boundaries and index spans, not whole documents.
  3. Write a grounding Skill with explicit citation format, abstention, and escalation rules.
  4. Add an independent auditor pass that verifies claim-to-span support before delivery.
  5. Launch behind a 200-question eval and monitor unsupported-claim rate weekly.

Prototype vs. shipped system

AspectNaive prototypeShipped system
Retrieval unitWhole articleClaim-level span
Citation targetArticle linkExact span + date
VerificationNoneIndependent auditor pass
Unknown handlingGuessesAbstain & escalate
Confidence basisDemo200-question eval

Frequently asked questions

How long did this take?

About three weeks: roughly one on corpus provenance and chunking, one on the verification loop and eval set, and one on handoff UX and monitoring. The corpus work was the longest and least glamorous.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Could we skip the auditor pass to save cost?

We tried; unsupported-claim rate roughly tripled. We kept the auditor on transactional answers and dropped it only for purely informational ones.

What surprised you most?

Customers preferred a clean "I'll connect you to an agent" over a confident wrong answer by a wide margin. Abstention improved satisfaction, not just safety.

From help-center citations to answered calls

This exact pattern — citable corpus, span-level grounding, verify-then-deliver, escalate when unsure — is how CallSphere builds voice and chat agents that handle support 24/7 and stay grounded in your real policies. See the live system at callsphere.ai.


Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.