A dynamic workflow in Claude Code, problem to shipped

Most writing about agentic AI stays at the level of capability lists — it can spawn subagents, load skills, call tools. What that misses is the texture of an actual run: where the agent decides things, where it stalls, where a human steps in, and what the harness has to look like for the whole thing to end in shipped code rather than a pile of plausible-looking diffs. This post follows one realistic task end to end, with the dynamic workflow doing the assembling at runtime.

The scenario is ordinary on purpose. A mid-size product team has a ticket: customers on annual plans are being charged a proration amount that does not match the invoice line items in a specific upgrade case. It is the kind of bug that touches billing logic, a database, and a customer-facing total — moderately reversible, moderately risky, frequent enough to be worth automating the investigation. We will trace how Claude Code takes it from problem to merge.

Framing the problem so the agent can own it

The engineer does not paste the raw ticket and walk away. The first move is to give the agent enough context to reason: a pointer to the billing module, the relevant CLAUDE.md notes about how proration is supposed to work, and a crisp acceptance condition — the invoice total must equal the sum of line items for the upgrade case, proven by a test. This framing is the difference between an agent that flails and one that has a target.

This is where the dynamic part begins. The engineer has not specified steps. They have specified the system's shape, the rules, and what "done" means. The agent will choose the steps. A dynamic workflow is exactly this: the agent assembles its own sequence of investigation, edits, and checks at runtime, guided by constraints rather than a fixed script.

How the agent assembles its plan at runtime

Claude Code starts by reading. It pulls the billing module, traces the proration function, and reads the failing case. It forms a hypothesis: the proration calculation rounds per-line-item, but the invoice total rounds once at the end, so the sums diverge by a cent in the upgrade path. It does not assume this is right — it writes a test that reproduces the discrepancy first.

flowchart TD
  A["Ticket + context loaded"] --> B["Agent reads billing module"]
  B --> C["Forms rounding hypothesis"]
  C --> D["Writes failing test to confirm"]
  D --> E{"Test reproduces bug?"}
  E -->|No| F["Revise hypothesis, reread"]
  F --> C
  E -->|Yes| G["Fix rounding, run full suite"]
  G --> H{"All green?"}
  H -->|No| I["Diagnose regression, adjust"]
  H -->|Yes| J["Open PR with explanation"]

Notice the loop the agent built without being told to: hypothesize, write a reproducing test, confirm, then fix. When the first test does not reproduce the bug, the agent does not barrel ahead — it revises and rereads. This self-directed verification is what makes the run trustworthy. The harness gave it a test runner and the ability to see failures, and the agent used them to keep itself honest. That is dynamic workflow behavior: the path was not scripted; it emerged from what the agent found.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Where it stalls and where the human steps in

The run is not friction-free, and pretending otherwise sells a fantasy. Halfway through, the agent finds that the proration logic is also used by a refund path it does not fully understand, and the rules CLAUDE.md describes do not cover refunds. A weak harness would let the agent guess. A good one has it surface the ambiguity: it pauses and asks whether the rounding fix should apply to refunds too, because it cannot verify the intended behavior from the code alone.

The engineer answers in one line — refunds round the same way, here is the rule — and the agent continues. That single exchange is the human-in-the-loop pattern working as designed. The agent did the heavy lifting of investigation and surfaced exactly the one decision that needed human judgment, instead of either stopping constantly or plowing through a guess. After the answer, the engineer adds the refund rule to CLAUDE.md so the next run never has to ask.

Verification before anything counts as done

With the fix in place, the agent runs the full test suite, not just the new test. This catches the second-order risk: a rounding change in shared code could break the refund path or the downgrade path. The suite comes back green except for one downgrade test, which surfaces a genuine interaction the agent had not considered. It diagnoses, adjusts the fix to handle the downgrade case, and reruns until everything passes.

This is the part that separates a shipped outcome from a demo. The agent's first fix was plausible and wrong in a subtle way. The verification loop — comprehensive tests, run inside the workflow — caught it, and the agent corrected without a human noticing the near-miss. The lesson teams internalize quickly is that the value of dynamic workflows is gated almost entirely by the quality of the checks. Without the downgrade test, this run ships a regression.

From green build to merged change

The agent opens a pull request. The description is not boilerplate — it explains the rounding root cause, the cases covered, the refund and downgrade interactions it handled, and the one decision it asked a human about. A reviewer reads it in a couple of minutes, confirms the reasoning, and merges. The whole cycle, from ticket to merge, took a fraction of the time a manual investigation would have, and most of the engineer's involvement was a single clarifying answer plus a short review.

What made it work was not the agent's raw intelligence in isolation. It was the harness around it: the context that framed the problem, the test runner that let it self-verify, the permission to ask when genuinely uncertain, and the comprehensive suite that caught the subtle regression. Strip any of those out and the run degrades — either into a guess, an unverified diff, or an agent that needs babysitting at every step.

What this walkthrough generalizes to

The shape repeats across task classes: frame the problem with context and an acceptance condition, let the agent assemble its own investigation-and-fix loop, have it self-verify with real checks, surface only the decisions that need human judgment, and gate the merge on a comprehensive suite. Teams that codify this shape — investing in the context and the tests once — get to run a whole category of work this way, not just one heroic bug fix. The first run is slow because you are building the harness. The hundredth is fast because the harness already exists.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

What the engineer chose not to do

It is worth naming the things the engineer deliberately did not do, because they are as instructive as the actions. They did not script the agent's steps — no "first read this file, then change that line." Scripting would have defeated the point; the value was in the agent discovering the rounding mismatch and the downgrade interaction on its own. They also did not walk away entirely. The one clarifying question about refunds was a genuine judgment call the code could not answer, and a harness that suppressed it would have shipped a guess.

And they did not skip the slow part. It is tempting, when an agent produces a clean-looking fix in minutes, to merge on the strength of how reasonable it sounds. The engineer instead let the full suite run and treated the downgrade failure as a gift rather than an annoyance — it caught a regression that human eyes would likely have missed in review. The whole episode is a small case study in restraint: trust the agent with the work, keep the judgment and the verification with the human, and let the harness mediate between the two.

Frequently asked questions

How much of this run is the agent versus the human?

The agent does the investigation, hypothesis, test-writing, fixing, and full-suite verification. The human frames the problem with context and an acceptance condition up front, answers one clarifying question mid-run, and reviews the final PR. Most of the labor shifts to the agent; the judgment stays with the human.

What makes a task a good fit for a dynamic workflow?

Tasks that are frequent enough to justify building the harness, reversible enough to survive a wrong attempt, and verifiable with automated checks. This billing bug qualifies: it recurs as a class, lives on a branch with version control, and has a clear test-based definition of done.

Why does the agent write a failing test before fixing the bug?

To confirm its hypothesis is actually correct before changing code. A reproducing test turns a guess into a verified diagnosis and gives the agent a concrete target. It also leaves behind a regression test, so the same bug cannot silently return later.

What happens if the agent's first fix is wrong?

The in-loop verification catches it. In this walkthrough the first fix passed the new test but broke a downgrade case the full suite caught. The agent diagnosed and corrected before the change reached a human, which is exactly why comprehensive, agent-runnable tests are the load-bearing part of the harness.

Bringing agentic AI to your phone lines

The same problem-to-shipped loop powers CallSphere's voice and chat agents: they investigate a caller's need, use tools mid-conversation, verify before acting, and hand off to a human on the one decision that needs it. See it live at callsphere.ai.

Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.

A dynamic workflow in Claude Code, problem to shipped

Framing the problem so the agent can own it

How the agent assembles its plan at runtime

Where it stalls and where the human steps in

Verification before anything counts as done

From green build to merged change

What this walkthrough generalizes to

What the engineer chose not to do

Frequently asked questions

How much of this run is the agent versus the human?

What makes a task a good fit for a dynamic workflow?

Why does the agent write a failing test before fixing the bug?

What happens if the agent's first fix is wrong?

Bringing agentic AI to your phone lines

Try CallSphere AI Voice Agents

Related Articles You May Like

Where Claude Cowork is heading and how to prepare

Where Claude Code GTM engineering is heading next

Measuring Claude Cowork success: metrics that prove it

How to measure success of Claude Code GTM workflows

Claude Cowork walkthrough: from problem to shipped

End-to-end Claude Code GTM workflow: a real rebuild