Migrating a Workflow to Claude Agents Without Breaking It (Skills For Organizations)
A staged playbook to migrate an existing workflow onto Claude agents — strangler-fig rollout, shadow mode, human-in-the-loop, and reversible cutover.
Most agent projects don't start from a blank page. They start with a workflow that already runs — a rules engine, a pile of scripts, a team doing the task by hand, or a brittle automation everyone is afraid to touch. The temptation is to rip it out and replace it with a shiny Claude-powered agent in one heroic cutover. That is also the fastest way to take down a process the business depends on. Migrating to agents safely is a different skill from building agents: it's mostly about controlling risk while you swap the engine on a moving car.
This post lays out a staged playbook — wrap, shadow, assist, then hand off — that lets you move an existing workflow onto Claude agents and skills with a rollback at every step and no big-bang moment to dread.
Key takeaways
- Migrate incrementally with a strangler-fig approach — wrap the old workflow, replace one slice at a time, never all at once.
- Run the agent in shadow mode first: it processes real inputs and you compare its output to the existing system without acting on it.
- Graduate to human-in-the-loop, where the agent proposes and a person approves, before any autonomous action.
- Keep the old path as a fallback and a feature flag for instant rollback at every stage.
- Don't port the legacy logic literally — map the workflow to tools and skills, letting the model handle the judgment the old rules approximated.
Map the workflow before you touch the model
Start by writing down what the existing workflow actually does — not what the documentation claims, but the real steps, the inputs, the decision points, and the edge cases the current system fumbles. Identify which steps are deterministic (these may stay as plain code or become tools) and which require judgment (these are where an agent earns its keep). The output of this exercise is a map: inputs, the sequence of decisions, the tools each decision needs, and a clear definition of what a successful run looks like.
This mapping is also where you decide the shape of the agent. Deterministic lookups and mutations become tools with tight schemas. Domain knowledge — the policies, the formats, the "how we do it here" — becomes a skill the agent loads. The agent orchestrates; your tools do the irreversible work under controlled permissions.
The staged rollout: wrap, shadow, assist, hand off
The safe path has four stages, and you don't advance until the current stage proves out on real traffic. Each stage adds agent autonomy while keeping the old system reachable. The point is that at no moment is the business relying on something you haven't watched run on real data.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Existing workflow (live)"] --> B["Stage 1: wrap behind a flag"]
B --> C["Stage 2: shadow mode — agent runs, output compared"]
C --> D{"Match rate acceptable?"}
D -->|No| C
D -->|Yes| E["Stage 3: human-in-the-loop approval"]
E --> F{"Approval rate high & stable?"}
F -->|No| E
F -->|Yes| G["Stage 4: autonomous with fallback"]Stage one is a wrapper: put the whole workflow behind a feature flag and a clean interface, changing nothing about behavior. This buys you a switch you can flip later and a seam to insert the agent. Stage two is shadow mode. Stage three is human-in-the-loop. Stage four is supervised autonomy. We'll take the two riskiest in turn.
Shadow mode: measure before you trust
In shadow mode the agent receives the same real inputs as the production workflow and produces its output, but that output is logged and compared — never acted on. The existing system stays in charge. This is the cheapest, safest way to learn how the agent behaves on your actual data distribution, including the long tail the demo never showed you.
Define a comparison metric up front. For a classification or routing task, that's agreement rate with the current system, with a human resolving disagreements to find out who was right (sometimes the agent is). For a generative task, sample outputs into your eval rubric. Run shadow mode long enough to cover real variety — peaks, edge cases, the weird Tuesday inputs — and watch not just accuracy but cost and latency. You're deciding whether this is ready to influence reality.
result = legacy_workflow.run(input) # still authoritative
if flags.enabled("agent_shadow", input):
try:
shadow = claude_agent.run(input) # logged, not acted on
log_comparison(input, legacy=result, agent=shadow)
except Exception as e:
log_shadow_error(input, e) # never breaks the live path
return resultHuman-in-the-loop: the agent proposes, a person disposes
When shadow data looks good, promote the agent to assist a human rather than replace one. Now the agent does the work and produces a proposed action — a draft reply, a suggested routing, a filled form — and a person reviews and approves before anything executes. This stage does two things at once: it protects you from the agent's mistakes, and it generates a stream of approve/edit/reject signals that are pure gold for your eval set and your skill instructions.
Track the approval rate and the edit rate. If reviewers approve most proposals untouched, the agent is ready for more autonomy on that slice. If they constantly rewrite a particular kind of output, you've found a precise weakness to fix in a tool or a skill before going further. Only when approval is high and stable do you let the agent act on the low-risk, high-confidence slice autonomously — keeping human review on the rest and the legacy fallback wired in.
Cutover and fallback: never burn the bridge
Even at stage four, autonomy should be partial and reversible. Let the agent run unattended on the cases it has earned, route the uncertain ones to a human, and keep the old workflow one flag-flip away. Watch your eval scores and operational metrics continuously; a model upgrade or a data shift can regress behavior, and you want to catch it from a dashboard, not a customer complaint.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Decommission the legacy path only after the agent has run autonomously and clean for a meaningful period across the full range of inputs. Until then, the old system isn't technical debt — it's your insurance policy, and it's cheap.
Common pitfalls
- Big-bang cutover. Replacing the whole workflow at once removes your ability to roll back and concentrates all the risk into one moment.
- Skipping shadow mode. Going straight to live action means your first encounter with the long tail is in production.
- Porting legacy rules literally. Re-encoding a thousand brittle if-statements as prompt text wastes the model's judgment; map to tools and skills instead.
- No comparison metric. Running a shadow with no defined agreement or quality measure gives you a vague feeling, not a go/no-go decision.
- Decommissioning the fallback too early. The old path is your rollback; keep it until the agent has proven out across the full input range.
Migrate a workflow in 6 steps
- Map the real workflow: inputs, decisions, deterministic vs. judgment steps, success definition.
- Wrap the existing workflow behind a feature flag with a clean interface.
- Run the agent in shadow mode on real inputs and measure agreement, cost, and latency.
- Promote to human-in-the-loop, where the agent proposes and a person approves.
- Grant autonomy only on the high-confidence slice, routing the rest to humans, with the fallback wired in.
- Decommission the legacy path only after a clean autonomous period across all input types.
Rollout stages at a glance
| Stage | Agent autonomy | Safety net |
|---|---|---|
| Wrap | None | Behavior unchanged |
| Shadow | Runs, doesn't act | Legacy still authoritative |
| Human-in-loop | Proposes only | Person approves each action |
| Supervised autonomy | Acts on safe slice | Flag rollback + legacy fallback |
Frequently asked questions
What is the strangler-fig approach for agent migration?
The strangler-fig approach replaces an existing workflow incrementally — wrapping it, then substituting one slice at a time with an agent — until the new system fully takes over, rather than doing a single risky cutover. It keeps a rollback available at every step.
What is shadow mode and why use it first?
Shadow mode runs the agent on real production inputs and logs its output for comparison without acting on it, so you can measure real-world behavior, cost, and accuracy against the existing system before trusting it with any live action.
Should I copy my old business rules into the prompt?
No. Map the workflow to tools (for deterministic actions) and skills (for domain knowledge), and let the model handle the judgment the old rules approximated. Porting hundreds of brittle rules verbatim wastes the agent's reasoning and is hard to maintain.
When can I turn off the legacy system?
Only after the agent has run autonomously and cleanly for a meaningful period across the full range of real inputs, with eval scores and operational metrics holding. Until then, keep the old path behind a flag as your instant rollback.
A safe path to agents on your phone lines
CallSphere uses this same staged, fallback-first rollout to move call and message handling onto voice and chat agents without disrupting the work that's already running. See how it's done at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.