Skip to content
Learn Agentic AI
Learn Agentic AI12 min read10 views

Building a Code Review Multi-Agent Pipeline

Build a complete multi-agent code review system with specialized agents for analysis, security review, style checking, and a manager agent that synthesizes findings into actionable review comments.

Why Multi-Agent Code Review

Single-prompt code review — pasting code into ChatGPT and asking "review this" — produces shallow, generic feedback. The model tries to be everything at once: security auditor, style guide enforcer, performance analyst, and architecture reviewer. The result is a laundry list of surface-level observations that misses the deep issues.

Human code review teams work differently. A security specialist focuses exclusively on vulnerabilities. A performance engineer looks for bottlenecks and unnecessary allocations. A senior architect evaluates design decisions. Each reviewer brings specialized expertise and a focused lens.

Multi-agent code review replicates this structure. Specialized agents focus on specific review dimensions, and a manager agent synthesizes their findings into a coherent, prioritized review. The result is dramatically better than what any single agent — or single prompt — can produce.

System Architecture

The code review pipeline has four specialist agents and one manager agent:

flowchart TD
    START["Building a Code Review Multi-Agent Pipeline"] --> A
    A["Why Multi-Agent Code Review"]
    A --> B
    B["System Architecture"]
    B --> C
    C["Defining the Output Schemas"]
    C --> D
    D["Building the Specialist Agents"]
    D --> E
    E["The Manager Agent"]
    E --> F
    F["Running the Pipeline"]
    F --> G
    G["Example: Reviewing a Real Endpoint"]
    G --> H
    H["Extending the Pipeline"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  1. Code Analyzer — examines logic, control flow, error handling, and potential bugs
  2. Security Reviewer — focuses exclusively on security vulnerabilities and unsafe patterns
  3. Style Checker — evaluates code style, naming conventions, readability, and documentation
  4. Manager Agent — receives all specialist reports and produces the final unified review

The specialists run in parallel (they are independent). The manager runs after all specialists complete.

Defining the Output Schemas

Structured outputs are essential for this pipeline. Each specialist produces a typed report, and the manager consumes all reports to produce the final review.

from pydantic import BaseModel
from enum import Enum

class Severity(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    INFO = "info"

class ReviewFinding(BaseModel):
    line_range: str  # e.g., "15-22" or "42"
    severity: Severity
    category: str
    description: str
    suggestion: str
    code_snippet: str  # The problematic code

class SpecialistReport(BaseModel):
    agent_name: str
    summary: str
    findings: list[ReviewFinding]
    overall_assessment: str

class FinalReview(BaseModel):
    summary: str
    critical_issues: list[ReviewFinding]
    high_issues: list[ReviewFinding]
    medium_issues: list[ReviewFinding]
    low_issues: list[ReviewFinding]
    positive_observations: list[str]
    recommendation: str  # "approve", "request_changes", "needs_discussion"

Building the Specialist Agents

Each specialist has narrowly focused instructions and returns a SpecialistReport. Narrow focus is what makes them effective — they are not trying to review everything, just their specific dimension.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from agents import Agent

code_analyzer = Agent(
    name="CodeAnalyzer",
    instructions="""You are an expert code analyst. Review the provided code
    focusing EXCLUSIVELY on:

    1. Logic errors — incorrect conditions, off-by-one errors, wrong operators
    2. Control flow — unreachable code, missing break/return, infinite loops
    3. Error handling — uncaught exceptions, swallowed errors, missing validation
    4. Edge cases — null/undefined handling, empty collections, boundary values
    5. Resource management — unclosed connections, memory leaks, missing cleanup

    Do NOT comment on style, naming, or security. Those are handled by other
    reviewers. Focus only on correctness and robustness.

    For each finding, specify the exact line range, provide the problematic
    code snippet, explain the issue, and suggest a specific fix.""",
    model="gpt-4o",
    output_type=SpecialistReport,
)

security_reviewer = Agent(
    name="SecurityReviewer",
    instructions="""You are a security-focused code reviewer. Review the
    provided code focusing EXCLUSIVELY on:

    1. Injection vulnerabilities — SQL injection, command injection, XSS
    2. Authentication/authorization — missing auth checks, privilege escalation
    3. Data exposure — sensitive data in logs, responses, or error messages
    4. Input validation — unsanitized user input, missing bounds checks
    5. Cryptographic issues — weak algorithms, hardcoded secrets, insecure random
    6. Dependency risks — known vulnerable patterns, unsafe deserialization

    Do NOT comment on code style or general logic. Focus only on security.
    Rate severity based on exploitability and impact. A SQL injection in a
    public endpoint is CRITICAL. A missing CSRF token on an internal-only
    endpoint is MEDIUM.""",
    model="gpt-4o",
    output_type=SpecialistReport,
)

style_checker = Agent(
    name="StyleChecker",
    instructions="""You are a code style and readability reviewer. Review
    the provided code focusing EXCLUSIVELY on:

    1. Naming — are variables, functions, and classes named clearly and consistently?
    2. Documentation — are public APIs documented? Are complex algorithms explained?
    3. Code organization — is the code structured logically? Are functions too long?
    4. Readability — could a new team member understand this code in one reading?
    5. Consistency — does the code follow consistent patterns throughout?
    6. Duplication — is there copy-pasted logic that should be extracted?

    Do NOT comment on security or logic correctness. Focus only on
    maintainability and readability. Severity should reflect impact on
    long-term maintenance: duplicated logic across 5 functions is HIGH,
    a slightly unclear variable name is LOW.""",
    model="gpt-4o",
    output_type=SpecialistReport,
)

The Manager Agent

The manager agent consumes all specialist reports and produces the final unified review. Its job is to deduplicate findings (different specialists may flag the same line for different reasons), prioritize by severity, and produce a recommendation.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["Code Analyzer — examines logic, control…"]
    CENTER --> N1["Security Reviewer — focuses exclusively…"]
    CENTER --> N2["Style Checker — evaluates code style, n…"]
    CENTER --> N3["Manager Agent — receives all specialist…"]
    CENTER --> N4["3 specialist agents running in parallel…"]
    CENTER --> N5["1 manager agent: ~12,000 input tokens c…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
manager_agent = Agent(
    name="ReviewManager",
    instructions="""You are a senior engineering manager synthesizing a code
    review from multiple specialist reviewers. You will receive reports from
    a code analyzer, security reviewer, and style checker.

    Your job:
    1. Deduplicate findings — if multiple reviewers flag the same code, merge
       their observations into a single finding with the highest severity
    2. Prioritize — critical and high issues first, grouped by location in the code
    3. Add positive observations — note things the code does well
    4. Make a recommendation:
       - "approve" if no critical or high issues exist
       - "request_changes" if any critical or high issues exist
       - "needs_discussion" if findings are ambiguous or involve architectural decisions

    Be constructive. Frame feedback as suggestions, not criticisms. Acknowledge
    good patterns alongside issues.""",
    model="gpt-4o",
    output_type=FinalReview,
)

Running the Pipeline

The pipeline runs specialists in parallel using asyncio.gather, then feeds their combined output to the manager.

import asyncio
import json
from agents import Runner

async def review_code(code: str, filename: str = "unknown") -> FinalReview:
    """Run the full multi-agent code review pipeline."""
    review_input = f"Review this code from file '{filename}':\n\n{code}"

    # Step 1: Run all specialists in parallel
    print("Running specialist reviews in parallel...")
    specialist_results = await asyncio.gather(
        Runner.run(code_analyzer, input=review_input),
        Runner.run(security_reviewer, input=review_input),
        Runner.run(style_checker, input=review_input),
        return_exceptions=True,
    )

    # Step 2: Collect successful reports
    reports = []
    for i, result in enumerate(specialist_results):
        if isinstance(result, Exception):
            print(f"  Specialist {i} failed: {result}")
            continue
        report = result.final_output
        reports.append(report)
        print(f"  {report.agent_name}: {len(report.findings)} findings")

    if not reports:
        raise RuntimeError("All specialist agents failed")

    # Step 3: Feed combined reports to manager
    print("Synthesizing final review...")
    combined_reports = "\n\n---\n\n".join([
        f"## Report from {r.agent_name}\n"
        f"Summary: {r.summary}\n"
        f"Findings:\n{json.dumps([f.model_dump() for f in r.findings], indent=2)}\n"
        f"Assessment: {r.overall_assessment}"
        for r in reports
    ])

    manager_input = (
        f"Synthesize the following specialist code review reports into a "
        f"final unified review.\n\n"
        f"File: {filename}\n\n"
        f"{combined_reports}"
    )

    manager_result = await Runner.run(manager_agent, input=manager_input)
    return manager_result.final_output

Example: Reviewing a Real Endpoint

Let us run the pipeline on a realistic code sample — a FastAPI endpoint with several issues planted across security, logic, and style dimensions.

sample_code = '''
from fastapi import APIRouter, Depends
from sqlalchemy.orm import Session
import os, subprocess

router = APIRouter()

@router.post("/users/search")
async def search_users(query: str, db: Session = Depends(get_db)):
    # search for users
    results = db.execute(f"SELECT * FROM users WHERE name LIKE '%{query}%'")
    users = results.fetchall()
    return {"users": [dict(u) for u in users], "count": len(users)}

@router.delete("/users/{user_id}")
async def delete_user(user_id: int, db: Session = Depends(get_db)):
    user = db.query(User).filter(User.id == user_id).first()
    db.delete(user)
    db.commit()
    return {"deleted": True}

@router.post("/admin/run-report")
async def run_report(report_name: str):
    result = subprocess.run(
        f"python reports/{report_name}.py",
        shell=True, capture_output=True, text=True,
    )
    return {"output": result.stdout, "error": result.stderr}
'''

async def main():
    review = await review_code(sample_code, filename="api/users.py")

    print(f"\nRecommendation: {review.recommendation}")
    print(f"\nCritical issues ({len(review.critical_issues)}):")
    for issue in review.critical_issues:
        print(f"  [{issue.severity.value}] Line {issue.line_range}: {issue.description}")

    print(f"\nHigh issues ({len(review.high_issues)}):")
    for issue in review.high_issues:
        print(f"  [{issue.severity.value}] Line {issue.line_range}: {issue.description}")

    print(f"\nPositive observations:")
    for obs in review.positive_observations:
        print(f"  + {obs}")

asyncio.run(main())

The specialists would identify: SQL injection in the search endpoint (security, critical), command injection in the report endpoint (security, critical), no null check before deleting a user (code analyzer, high), no authentication on the delete and admin endpoints (security, high), raw SQL instead of ORM query (style, medium), and minimal documentation (style, low).

Extending the Pipeline

The modular architecture makes it straightforward to add new specialist agents. Common additions include a performance reviewer that looks for N+1 queries, unnecessary allocations, and missing pagination, and a test coverage agent that identifies untested code paths and suggests test cases.

performance_reviewer = Agent(
    name="PerformanceReviewer",
    instructions="""You are a performance-focused code reviewer. Focus on:
    1. Database query efficiency — N+1 queries, missing indexes, full table scans
    2. Memory usage — large object creation in loops, unbounded list growth
    3. I/O patterns — blocking calls in async contexts, missing connection pooling
    4. Pagination — endpoints returning unbounded result sets
    5. Caching opportunities — repeated expensive computations
    Only flag issues with measurable performance impact.""",
    model="gpt-4o",
    output_type=SpecialistReport,
)

To add it to the pipeline, include it in the asyncio.gather call alongside the other specialists. The manager agent's instructions already handle an arbitrary number of specialist reports because it processes them generically.

Cost and Latency Profile

For a 200-line code file, the pipeline costs approximately:

  • 3 specialist agents running in parallel on gpt-4o: ~15,000 input tokens + ~3,000 output tokens each
  • 1 manager agent: ~12,000 input tokens (combined reports) + ~2,000 output tokens
  • Total: ~60,000 tokens, approximately $0.40-0.60 per review

Latency is dominated by the specialist parallel step (5-8 seconds) plus the manager step (3-5 seconds), giving a total pipeline time of 8-13 seconds. This is fast enough for CI/CD integration on pull requests.

Integration with CI/CD

The pipeline can be triggered on pull request events. Parse the diff to extract changed files, run each file through the review pipeline, and post the findings as PR comments. Filter to only critical and high severity findings for automated comments — medium and low findings can go into a summary comment that does not block the review.

This multi-agent code review pipeline demonstrates how specialized agents working in parallel, coordinated by a manager, produce dramatically better results than any single-agent approach. Each specialist focuses deeply on its domain, and the manager synthesizes a coherent, actionable review.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.