A Claude Cowork plugin walkthrough: problem to shipped
A realistic end-to-end Claude Cowork plugin build — decompose the problem, wire connectors and skills, prove it with evals, and ship a reusable workflow.
Abstract advice about agentic work only gets you so far. To see how Claude Cowork plugins actually deliver value, it helps to follow one all the way through — from the messy problem a real team faces, through building the plugin, to the point where it is shipped, trusted, and reused. This walkthrough is a composite of how these projects tend to go, with the decisions and dead ends left in rather than smoothed over.
The team in question is a mid-sized company's revenue operations group. Every Monday they assemble a pipeline review: pulling deal data from the CRM, cross-referencing recent activity, flagging at-risk deals, and writing a narrative summary for leadership. It takes two analysts most of a day, the output is inconsistent depending on who does it, and by the time it is ready the data is already stale. This is a perfect candidate for a plugin: repetitive, rule-bound at its core, but requiring judgment in the summary.
Step one: define the problem precisely
The temptation is to say "automate the pipeline review." That is too vague to build against. The first real work is decomposition. We sat with the analysts and broke the task into stages: gather deal records updated in the last week, join them to activity logs, apply the at-risk rules (no contact in 14 days, slipped close date, stalled stage), rank the flags, and draft a summary in the leadership's preferred format. Writing this down turned a fuzzy chore into a sequence with clear inputs, outputs, and a definition of done at each stage.
This decomposition is also where you discover what the agent needs to touch. To gather and join data it needs read access to the CRM and the activity system. To draft the summary it needs the past few reports as examples. Nothing in the task requires write access to the CRM, which immediately tells us the connectors should be read-only — a risk decision made for free, just by being precise about the problem.
Step two: wire the connectors and skills
With the stages defined, we built the plugin's parts. Two MCP connectors gave read access to the CRM and the activity log, scoped to exactly the fields the task needed. A skill encoded the at-risk rules in plain language with examples, so the agent applied them consistently rather than improvising. Another skill captured the report format and tone, with two anonymized past reports as references.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart TD
A["Monday trigger"] --> B["CRM connector: pull deals updated in 7 days"]
B --> C["Activity connector: join contact logs"]
C --> D["Apply at-risk rules skill"]
D --> E{"Any flags found?"}
E -->|No| F["Draft 'pipeline healthy' summary"]
E -->|Yes| G["Rank flags by deal value & risk"]
G --> H["Draft summary in leadership format"]
F --> I["Analyst reviews & approves"]
H --> I --> J["Post to leadership channel"]The diagram is the plugin's actual flow. A Claude Cowork plugin is a packaged bundle of skills, MCP connectors, and sub-agents that a non-engineer can install to run a complete agentic workflow. Notice the human approval node before anything is posted — that was non-negotiable for a report executives read, and it cost almost nothing because the agent did all the gathering and drafting.
Step three: the first runs go wrong (usefully)
The first version did not work cleanly, which is normal and good. On the first real run, the agent flagged deals as at-risk that the analysts knew were fine — the "no contact in 14 days" rule fired on deals that were intentionally paused. The analysts caught it in review, which is exactly what the approval step is for. Rather than patch the prompt ad hoc, we updated the at-risk skill to recognize a "paused" status and exclude it. The lesson became part of the plugin, so it would never recur.
The second issue was tone: the draft summaries were accurate but read like a robot wrote them. We fed two more example reports into the format skill and the next drafts landed in the right voice. This is the pattern — every gap between the agent's output and what good looks like becomes a concrete improvement to a skill, not a vague instruction to "do better."
Step four: prove it with evals before going wide
Before letting the plugin run without an analyst hovering over every line, we built a small evaluation set: ten past Mondays where we knew the correct flags and roughly what the summary should say. We ran the plugin against each and checked whether it caught the right at-risk deals and produced a usable draft. It got the flags right on nine of ten and the miss was a genuinely ambiguous case the analysts had debated themselves.
That eval set is now a guardrail. When the underlying model updates or we change a skill, we re-run those ten cases first. If the plugin still passes, we ship the change with confidence. If it regresses, we catch it before it reaches the team. This is what separates a demo from a production plugin: the ability to prove it still works after a change without manual spot-checking.
Step five: ship, measure, and reuse
The shipped plugin turned a full day of two analysts' time into roughly thirty minutes of review and light editing. The summary is consistent week to week because the format lives in a skill, not in a person's head. And because the at-risk rules are explicit, leadership trusts the flags — they know exactly why a deal was surfaced.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The reuse came later and almost by accident. A neighboring team that does renewal reviews realized the same scaffold — pull data, apply rules, rank, draft, approve — fit their problem with different connectors and rules. They forked the plugin, swapped the skills, and were running in days rather than weeks. That is the compounding value of building this way: the first plugin is an investment, and every subsequent one is cheaper because the patterns and connectors already exist.
Frequently asked questions
How long did this plugin take to build?
The bulk of the work was decomposition and the first few corrective runs, spread over a couple of weeks of part-time effort by a plugin author working with the analysts. The connectors were the smallest piece. Most of the time went into getting the at-risk rules and tone right and building the eval set — the parts that make it trustworthy.
Why keep a human approval step if the agent is reliable?
Because the output goes to executives and the cost of approval is tiny relative to the cost of a wrong report reaching leadership. The agent eliminated the gathering and drafting toil; the human kept the judgment call. Over time, as the evals proved consistent, the review got faster, but for high-visibility output the approval stayed.
What made this a good first plugin to build?
It was repetitive, the rules could be made explicit, it needed only read access, and there was a clear definition of done. Tasks that are pure judgment with no structure are hard first projects; tasks that are pure mechanics do not need an agent. This one sat in the sweet spot of structured work plus a judgment layer.
How do you handle the agent being wrong in production?
The approval step catches errors before they ship, and every caught error becomes a skill update so it does not recur. Combined with the eval set that runs before any change, the plugin gets more reliable over time rather than drifting. Errors are treated as inputs to improvement, not as one-off fixes.
From workflows to live conversations
This same problem-to-shipped arc applies when the agent talks to customers. CallSphere builds voice and chat assistants the same way — decompose the call flow, wire scoped tools, prove it with evals, and ship an agent that answers every call and books work 24/7. See a working example at callsphere.ai.
Source & attribution: This is an independent, original explainer inspired by Anthropic's coverage on the Claude blog. Claude, Claude Code, Claude Cowork, Claude Opus, and the Model Context Protocol are products and trademarks of Anthropic. CallSphere is not affiliated with or endorsed by Anthropic.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.