Building a Code Review Multi-Agent Pipeline
Build a complete multi-agent code review system with specialized agents for analysis, security review, style checking, and a manager agent that synthesizes findings into actionable review comments.
Why Multi-Agent Code Review
Single-prompt code review — pasting code into ChatGPT and asking "review this" — produces shallow, generic feedback. The model tries to be everything at once: security auditor, style guide enforcer, performance analyst, and architecture reviewer. The result is a laundry list of surface-level observations that misses the deep issues.
Human code review teams work differently. A security specialist focuses exclusively on vulnerabilities. A performance engineer looks for bottlenecks and unnecessary allocations. A senior architect evaluates design decisions. Each reviewer brings specialized expertise and a focused lens.
Multi-agent code review replicates this structure. Specialized agents focus on specific review dimensions, and a manager agent synthesizes their findings into a coherent, prioritized review. The result is dramatically better than what any single agent — or single prompt — can produce.
System Architecture
The code review pipeline has four specialist agents and one manager agent:
flowchart TD
START["Building a Code Review Multi-Agent Pipeline"] --> A
A["Why Multi-Agent Code Review"]
A --> B
B["System Architecture"]
B --> C
C["Defining the Output Schemas"]
C --> D
D["Building the Specialist Agents"]
D --> E
E["The Manager Agent"]
E --> F
F["Running the Pipeline"]
F --> G
G["Example: Reviewing a Real Endpoint"]
G --> H
H["Extending the Pipeline"]
H --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
- Code Analyzer — examines logic, control flow, error handling, and potential bugs
- Security Reviewer — focuses exclusively on security vulnerabilities and unsafe patterns
- Style Checker — evaluates code style, naming conventions, readability, and documentation
- Manager Agent — receives all specialist reports and produces the final unified review
The specialists run in parallel (they are independent). The manager runs after all specialists complete.
Defining the Output Schemas
Structured outputs are essential for this pipeline. Each specialist produces a typed report, and the manager consumes all reports to produce the final review.
from pydantic import BaseModel
from enum import Enum
class Severity(str, Enum):
CRITICAL = "critical"
HIGH = "high"
MEDIUM = "medium"
LOW = "low"
INFO = "info"
class ReviewFinding(BaseModel):
line_range: str # e.g., "15-22" or "42"
severity: Severity
category: str
description: str
suggestion: str
code_snippet: str # The problematic code
class SpecialistReport(BaseModel):
agent_name: str
summary: str
findings: list[ReviewFinding]
overall_assessment: str
class FinalReview(BaseModel):
summary: str
critical_issues: list[ReviewFinding]
high_issues: list[ReviewFinding]
medium_issues: list[ReviewFinding]
low_issues: list[ReviewFinding]
positive_observations: list[str]
recommendation: str # "approve", "request_changes", "needs_discussion"
Building the Specialist Agents
Each specialist has narrowly focused instructions and returns a SpecialistReport. Narrow focus is what makes them effective — they are not trying to review everything, just their specific dimension.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from agents import Agent
code_analyzer = Agent(
name="CodeAnalyzer",
instructions="""You are an expert code analyst. Review the provided code
focusing EXCLUSIVELY on:
1. Logic errors — incorrect conditions, off-by-one errors, wrong operators
2. Control flow — unreachable code, missing break/return, infinite loops
3. Error handling — uncaught exceptions, swallowed errors, missing validation
4. Edge cases — null/undefined handling, empty collections, boundary values
5. Resource management — unclosed connections, memory leaks, missing cleanup
Do NOT comment on style, naming, or security. Those are handled by other
reviewers. Focus only on correctness and robustness.
For each finding, specify the exact line range, provide the problematic
code snippet, explain the issue, and suggest a specific fix.""",
model="gpt-4o",
output_type=SpecialistReport,
)
security_reviewer = Agent(
name="SecurityReviewer",
instructions="""You are a security-focused code reviewer. Review the
provided code focusing EXCLUSIVELY on:
1. Injection vulnerabilities — SQL injection, command injection, XSS
2. Authentication/authorization — missing auth checks, privilege escalation
3. Data exposure — sensitive data in logs, responses, or error messages
4. Input validation — unsanitized user input, missing bounds checks
5. Cryptographic issues — weak algorithms, hardcoded secrets, insecure random
6. Dependency risks — known vulnerable patterns, unsafe deserialization
Do NOT comment on code style or general logic. Focus only on security.
Rate severity based on exploitability and impact. A SQL injection in a
public endpoint is CRITICAL. A missing CSRF token on an internal-only
endpoint is MEDIUM.""",
model="gpt-4o",
output_type=SpecialistReport,
)
style_checker = Agent(
name="StyleChecker",
instructions="""You are a code style and readability reviewer. Review
the provided code focusing EXCLUSIVELY on:
1. Naming — are variables, functions, and classes named clearly and consistently?
2. Documentation — are public APIs documented? Are complex algorithms explained?
3. Code organization — is the code structured logically? Are functions too long?
4. Readability — could a new team member understand this code in one reading?
5. Consistency — does the code follow consistent patterns throughout?
6. Duplication — is there copy-pasted logic that should be extracted?
Do NOT comment on security or logic correctness. Focus only on
maintainability and readability. Severity should reflect impact on
long-term maintenance: duplicated logic across 5 functions is HIGH,
a slightly unclear variable name is LOW.""",
model="gpt-4o",
output_type=SpecialistReport,
)
The Manager Agent
The manager agent consumes all specialist reports and produces the final unified review. Its job is to deduplicate findings (different specialists may flag the same line for different reasons), prioritize by severity, and produce a recommendation.
flowchart TD
CENTER(("Core Concepts"))
CENTER --> N0["Code Analyzer — examines logic, control…"]
CENTER --> N1["Security Reviewer — focuses exclusively…"]
CENTER --> N2["Style Checker — evaluates code style, n…"]
CENTER --> N3["Manager Agent — receives all specialist…"]
CENTER --> N4["3 specialist agents running in parallel…"]
CENTER --> N5["1 manager agent: ~12,000 input tokens c…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
manager_agent = Agent(
name="ReviewManager",
instructions="""You are a senior engineering manager synthesizing a code
review from multiple specialist reviewers. You will receive reports from
a code analyzer, security reviewer, and style checker.
Your job:
1. Deduplicate findings — if multiple reviewers flag the same code, merge
their observations into a single finding with the highest severity
2. Prioritize — critical and high issues first, grouped by location in the code
3. Add positive observations — note things the code does well
4. Make a recommendation:
- "approve" if no critical or high issues exist
- "request_changes" if any critical or high issues exist
- "needs_discussion" if findings are ambiguous or involve architectural decisions
Be constructive. Frame feedback as suggestions, not criticisms. Acknowledge
good patterns alongside issues.""",
model="gpt-4o",
output_type=FinalReview,
)
Running the Pipeline
The pipeline runs specialists in parallel using asyncio.gather, then feeds their combined output to the manager.
import asyncio
import json
from agents import Runner
async def review_code(code: str, filename: str = "unknown") -> FinalReview:
"""Run the full multi-agent code review pipeline."""
review_input = f"Review this code from file '{filename}':\n\n{code}"
# Step 1: Run all specialists in parallel
print("Running specialist reviews in parallel...")
specialist_results = await asyncio.gather(
Runner.run(code_analyzer, input=review_input),
Runner.run(security_reviewer, input=review_input),
Runner.run(style_checker, input=review_input),
return_exceptions=True,
)
# Step 2: Collect successful reports
reports = []
for i, result in enumerate(specialist_results):
if isinstance(result, Exception):
print(f" Specialist {i} failed: {result}")
continue
report = result.final_output
reports.append(report)
print(f" {report.agent_name}: {len(report.findings)} findings")
if not reports:
raise RuntimeError("All specialist agents failed")
# Step 3: Feed combined reports to manager
print("Synthesizing final review...")
combined_reports = "\n\n---\n\n".join([
f"## Report from {r.agent_name}\n"
f"Summary: {r.summary}\n"
f"Findings:\n{json.dumps([f.model_dump() for f in r.findings], indent=2)}\n"
f"Assessment: {r.overall_assessment}"
for r in reports
])
manager_input = (
f"Synthesize the following specialist code review reports into a "
f"final unified review.\n\n"
f"File: {filename}\n\n"
f"{combined_reports}"
)
manager_result = await Runner.run(manager_agent, input=manager_input)
return manager_result.final_output
Example: Reviewing a Real Endpoint
Let us run the pipeline on a realistic code sample — a FastAPI endpoint with several issues planted across security, logic, and style dimensions.
sample_code = '''
from fastapi import APIRouter, Depends
from sqlalchemy.orm import Session
import os, subprocess
router = APIRouter()
@router.post("/users/search")
async def search_users(query: str, db: Session = Depends(get_db)):
# search for users
results = db.execute(f"SELECT * FROM users WHERE name LIKE '%{query}%'")
users = results.fetchall()
return {"users": [dict(u) for u in users], "count": len(users)}
@router.delete("/users/{user_id}")
async def delete_user(user_id: int, db: Session = Depends(get_db)):
user = db.query(User).filter(User.id == user_id).first()
db.delete(user)
db.commit()
return {"deleted": True}
@router.post("/admin/run-report")
async def run_report(report_name: str):
result = subprocess.run(
f"python reports/{report_name}.py",
shell=True, capture_output=True, text=True,
)
return {"output": result.stdout, "error": result.stderr}
'''
async def main():
review = await review_code(sample_code, filename="api/users.py")
print(f"\nRecommendation: {review.recommendation}")
print(f"\nCritical issues ({len(review.critical_issues)}):")
for issue in review.critical_issues:
print(f" [{issue.severity.value}] Line {issue.line_range}: {issue.description}")
print(f"\nHigh issues ({len(review.high_issues)}):")
for issue in review.high_issues:
print(f" [{issue.severity.value}] Line {issue.line_range}: {issue.description}")
print(f"\nPositive observations:")
for obs in review.positive_observations:
print(f" + {obs}")
asyncio.run(main())
The specialists would identify: SQL injection in the search endpoint (security, critical), command injection in the report endpoint (security, critical), no null check before deleting a user (code analyzer, high), no authentication on the delete and admin endpoints (security, high), raw SQL instead of ORM query (style, medium), and minimal documentation (style, low).
Extending the Pipeline
The modular architecture makes it straightforward to add new specialist agents. Common additions include a performance reviewer that looks for N+1 queries, unnecessary allocations, and missing pagination, and a test coverage agent that identifies untested code paths and suggests test cases.
performance_reviewer = Agent(
name="PerformanceReviewer",
instructions="""You are a performance-focused code reviewer. Focus on:
1. Database query efficiency — N+1 queries, missing indexes, full table scans
2. Memory usage — large object creation in loops, unbounded list growth
3. I/O patterns — blocking calls in async contexts, missing connection pooling
4. Pagination — endpoints returning unbounded result sets
5. Caching opportunities — repeated expensive computations
Only flag issues with measurable performance impact.""",
model="gpt-4o",
output_type=SpecialistReport,
)
To add it to the pipeline, include it in the asyncio.gather call alongside the other specialists. The manager agent's instructions already handle an arbitrary number of specialist reports because it processes them generically.
Cost and Latency Profile
For a 200-line code file, the pipeline costs approximately:
- 3 specialist agents running in parallel on gpt-4o: ~15,000 input tokens + ~3,000 output tokens each
- 1 manager agent: ~12,000 input tokens (combined reports) + ~2,000 output tokens
- Total: ~60,000 tokens, approximately $0.40-0.60 per review
Latency is dominated by the specialist parallel step (5-8 seconds) plus the manager step (3-5 seconds), giving a total pipeline time of 8-13 seconds. This is fast enough for CI/CD integration on pull requests.
Integration with CI/CD
The pipeline can be triggered on pull request events. Parse the diff to extract changed files, run each file through the review pipeline, and post the findings as PR comments. Filter to only critical and high severity findings for automated comments — medium and low findings can go into a summary comment that does not block the review.
This multi-agent code review pipeline demonstrates how specialized agents working in parallel, coordinated by a manager, produce dramatically better results than any single-agent approach. Each specialist focuses deeply on its domain, and the manager synthesizes a coherent, actionable review.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.