Skip to content
Learn Agentic AI
Learn Agentic AI11 min read10 views

Building a Debate Agent System: Two AI Agents That Argue to Find Better Answers

Build a multi-agent debate system where pro and con agents construct opposing arguments while a judge agent evaluates quality, driving convergence toward more accurate and nuanced answers.

Why AI Debates Produce Better Answers

A single LLM answering a question tends to commit to one perspective early and then reinforce it. This leads to confirmation bias, missed nuances, and overconfident conclusions. The debate architecture fixes this by forcing two agents to argue opposing sides while a third agent judges the quality of their arguments.

Research from Anthropic, Google DeepMind, and others has shown that multi-agent debate consistently improves accuracy on reasoning, math, and factual tasks compared to single-agent approaches. The mechanism is simple: adversarial pressure exposes weak reasoning that self-reflection alone would miss.

Architecture Overview

The system has three agent roles:

flowchart TD
    START["Building a Debate Agent System: Two AI Agents Tha…"] --> A
    A["Why AI Debates Produce Better Answers"]
    A --> B
    B["Architecture Overview"]
    B --> C
    C["The Debater Agents"]
    C --> D
    D["The Judge Agent"]
    D --> E
    E["The Debate Loop"]
    E --> F
    F["Synthesis: Combining the Best of Both S…"]
    F --> G
    G["Convergence and Quality"]
    G --> H
    H["FAQ"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
  1. Pro Agent — argues in favor of a position
  2. Con Agent — argues against the same position
  3. Judge Agent — evaluates arguments, identifies the strongest points, and synthesizes a final answer
from pydantic import BaseModel
from openai import OpenAI
import json

client = OpenAI()

class DebateRound(BaseModel):
    round_number: int
    pro_argument: str
    con_argument: str
    judge_feedback: str
    pro_score: float
    con_score: float

class DebateResult(BaseModel):
    question: str
    rounds: list[DebateRound]
    final_answer: str
    confidence: float

The Debater Agents

Each debater receives the question, its assigned side, and the history of previous rounds so it can respond to the opponent:

def create_debater_message(
    question: str,
    side: str,
    history: list[DebateRound],
    round_num: int,
) -> str:
    """Generate an argument for one side of the debate."""
    history_text = ""
    for r in history:
        history_text += f"\n--- Round {r.round_number} ---\n"
        history_text += f"Pro: {r.pro_argument}\n"
        history_text += f"Con: {r.con_argument}\n"
        history_text += f"Judge: {r.judge_feedback}\n"

    side_instruction = {
        "pro": "Argue IN FAVOR of the position. Build on your previous points and directly counter the opponent's strongest arguments.",
        "con": "Argue AGAINST the position. Build on your previous points and directly counter the opponent's strongest arguments.",
    }

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""You are a skilled debater arguing the {side} side.
Rules:
- Make specific, evidence-based arguments
- Directly address your opponent's strongest points
- Acknowledge valid opposing points but explain why your side is stronger
- Do NOT strawman the opponent
- Be concise: 150-200 words per round

{side_instruction[side]}"""},
            {"role": "user", "content": (
                f"Question: {question}\n"
                f"Debate history: {history_text}\n"
                f"Round {round_num}: Present your {side} argument."
            )},
        ],
    )
    return response.choices[0].message.content

The Judge Agent

The judge evaluates both sides after each round, scores them, and provides feedback that guides the next round:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def judge_round(
    question: str,
    pro_argument: str,
    con_argument: str,
    history: list[DebateRound],
) -> dict:
    """Judge evaluates both arguments and provides scores."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are an impartial debate judge.
Evaluate both arguments on:
1. Logical validity (is the reasoning sound?)
2. Evidence quality (are claims supported?)
3. Responsiveness (does it address the opponent's points?)
4. Persuasiveness (how compelling is the overall argument?)

Score each side 0-10. Identify:
- The single strongest point from each side
- The single weakest point from each side
- What each side should address in the next round

Be genuinely impartial. Do not favor either side by default."""},
            {"role": "user", "content": (
                f"Question: {question}\n"
                f"Pro argument: {pro_argument}\n"
                f"Con argument: {con_argument}\n"
                "Evaluate and return JSON with: pro_score, con_score, "
                "feedback, strongest_pro_point, strongest_con_point."
            )},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

The Debate Loop

def run_debate(question: str, num_rounds: int = 3) -> DebateResult:
    """Run a full multi-round debate and produce a final answer."""
    rounds: list[DebateRound] = []

    for round_num in range(1, num_rounds + 1):
        # Both sides argue
        pro_arg = create_debater_message(question, "pro", rounds, round_num)
        con_arg = create_debater_message(question, "con", rounds, round_num)

        # Judge evaluates
        judgment = judge_round(question, pro_arg, con_arg, rounds)

        round_result = DebateRound(
            round_number=round_num,
            pro_argument=pro_arg,
            con_argument=con_arg,
            judge_feedback=judgment.get("feedback", ""),
            pro_score=judgment.get("pro_score", 5.0),
            con_score=judgment.get("con_score", 5.0),
        )
        rounds.append(round_result)
        print(f"Round {round_num}: Pro={round_result.pro_score:.1f} Con={round_result.con_score:.1f}")

    # Synthesize final answer from the full debate
    final = synthesize_debate(question, rounds)
    return DebateResult(
        question=question,
        rounds=rounds,
        final_answer=final["answer"],
        confidence=final["confidence"],
    )

Synthesis: Combining the Best of Both Sides

The final answer should not simply pick a winner. Instead, it synthesizes the strongest points from both sides into a nuanced conclusion:

def synthesize_debate(question: str, rounds: list[DebateRound]) -> dict:
    """Produce a final answer that incorporates the best arguments."""
    debate_summary = "\n".join(
        f"Round {r.round_number}: Pro({r.pro_score}) said: {r.pro_argument[:200]}... "
        f"Con({r.con_score}) said: {r.con_argument[:200]}..."
        for r in rounds
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Synthesize the debate into a final answer.
- Incorporate the strongest validated points from both sides
- Acknowledge genuine uncertainty where the debate was inconclusive
- Provide a clear conclusion with appropriate caveats
- Rate your confidence (0.0-1.0) based on how decisive the debate was"""},
            {"role": "user", "content": (
                f"Question: {question}\nDebate summary:\n{debate_summary}"
            )},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Convergence and Quality

Well-designed debates converge over 2-4 rounds. Watch for: (1) debaters repeating arguments without new substance (time to stop), (2) scores stabilizing (the strongest arguments have been found), or (3) the judge identifying that both sides agree on key points (consensus reached). Set a maximum round limit of 4-5 to control costs.

FAQ

Does the debate always improve answer quality?

For factual and reasoning tasks, yes — studies consistently show improvement over single-agent baselines. For creative tasks, debates can be overly analytical and suppress creative thinking. For opinion-based questions, debates produce better-nuanced answers but may feel indecisive.

Can you use more than two debaters?

Yes. A "panel" format with 3-4 agents each defending a different position works well for questions with more than two viable answers. The judge then evaluates across all positions. Be aware that costs scale linearly with the number of debaters.

How do you prevent the debaters from agreeing too quickly?

Assign strong contrarian system prompts and penalize the judge for scoring both sides equally in early rounds. Some implementations use a "devil's advocate" instruction that forces the con agent to find flaws even when it might privately agree with the pro side.


#DebateAgents #MultiAgentSystems #AdversarialAI #AIDebate #AgenticAI #PythonAI #ReasoningImprovement #AgentArchitecture

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Use Cases

Building a Multi-Agent Insurance Intake System: How AI Handles Policy Questions, Quotes, and Bind Requests Over the Phone

Learn how multi-agent AI voice systems handle insurance intake calls — policy questions, quoting, and bind requests — reducing agent workload by 60%.

AI Interview Prep

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)

Real AI coding interview questions from Anthropic, Meta, and OpenAI in 2026. Includes implementing attention from scratch, Anthropic's progressive coding screens, Meta's AI-assisted round, and vector search — with solution approaches.

AI Interview Prep

7 Agentic AI & Multi-Agent System Interview Questions for 2026

Real agentic AI and multi-agent system interview questions from Anthropic, OpenAI, and Microsoft in 2026. Covers agent design patterns, memory systems, safety, orchestration frameworks, tool calling, and evaluation.

Learn Agentic AI

Flat vs Hierarchical vs Mesh: Choosing the Right Multi-Agent Topology

Architectural comparison of multi-agent topologies including flat, hierarchical, and mesh designs with performance trade-offs, decision frameworks, and migration strategies.

Learn Agentic AI

Building a Multi-Agent Data Pipeline: Ingestion, Transformation, and Analysis Agents

Build a three-agent data pipeline with ingestion, transformation, and analysis agents that process data from APIs, CSVs, and databases using Python.

Learn Agentic AI

Microservices for AI Agents: Service Decomposition and Inter-Agent Communication

How to structure AI agents as microservices with proper service boundaries, gRPC communication, circuit breakers, health checks, and service mesh integration.