Skip to content
Learn Agentic AI
Learn Agentic AI12 min read4 views

MCPServerManager: Orchestrating Multiple MCP Servers

Use MCPServerManager to orchestrate multiple MCP server connections with automatic failure detection, reconnection strategies, and health monitoring using active_servers, failed_servers, and drop_failed_servers.

The Multi-Server Challenge

Production agents rarely use a single MCP server. A typical enterprise agent might connect to:

  • A filesystem server for document access
  • A database server for customer records
  • A search server for knowledge base queries
  • A custom business logic server for domain operations
  • An email server for sending notifications

When everything is healthy, this works well. But in production, servers crash, network connections drop, and deployments restart services. A single failed server can break the entire agent if connections are not managed properly.

MCPServerManager is the orchestration layer that handles multi-server lifecycle management. It tracks which servers are active, which have failed, and provides strategies for recovery — so your agent degrades gracefully instead of crashing.

Setting Up MCPServerManager

MCPServerManager wraps multiple MCP server instances and provides a unified interface for connection management:

flowchart TD
    START["MCPServerManager: Orchestrating Multiple MCP Serv…"] --> A
    A["The Multi-Server Challenge"]
    A --> B
    B["Setting Up MCPServerManager"]
    B --> C
    C["Connecting with the Manager"]
    C --> D
    D["Monitoring Active and Failed Servers"]
    D --> E
    E["Dropping Failed Servers"]
    E --> F
    F["Reconnection Strategies"]
    F --> G
    G["Integrating with Agent Runner"]
    G --> H
    H["Best Practices for Multi-Server Agents"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from agents.mcp import (
    MCPServerStdio,
    MCPServerStreamableHTTP,
    MCPServerManager,
)

# Define your servers
filesystem = MCPServerStdio(
    name="Filesystem",
    params={
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"],
    },
    cache_tools_list=True,
)

database = MCPServerStreamableHTTP(
    name="Database",
    params={"url": "http://db-mcp:8001/mcp"},
    cache_tools_list=True,
)

search = MCPServerStreamableHTTP(
    name="Search",
    params={"url": "http://search-mcp:8002/mcp"},
    cache_tools_list=True,
)

custom_tools = MCPServerStdio(
    name="BusinessLogic",
    params={
        "command": "python",
        "args": ["business_logic_server.py"],
    },
    cache_tools_list=True,
)

# Create the manager
manager = MCPServerManager(
    servers=[filesystem, database, search, custom_tools]
)

Connecting with the Manager

Use the manager as an async context manager. It handles connecting to all servers and provides status tracking:

from agents import Agent, Runner

agent = Agent(
    name="Enterprise Assistant",
    instructions="You help employees with file access, data queries, and business operations.",
    mcp_servers=[filesystem, database, search, custom_tools],
)

async def run_agent(user_message: str):
    async with manager:
        # Check which servers connected successfully
        active = manager.active_servers
        failed = manager.failed_servers

        print(f"Active servers: {[s.name for s in active]}")
        print(f"Failed servers: {[s.name for s in failed]}")

        if not active:
            return "All MCP servers are unavailable. Please try again later."

        result = await Runner.run(agent, user_message)
        return result.final_output

The key difference from managing servers individually is that MCPServerManager does not raise an exception if one server fails to connect. Instead, it tracks the failure and lets you decide how to respond.

Monitoring Active and Failed Servers

MCPServerManager provides two properties for monitoring server health:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

flowchart TD
    CENTER(("Core Concepts"))
    CENTER --> N0["A filesystem server for document access"]
    CENTER --> N1["A database server for customer records"]
    CENTER --> N2["A search server for knowledge base quer…"]
    CENTER --> N3["A custom business logic server for doma…"]
    CENTER --> N4["An email server for sending notificatio…"]
    CENTER --> N5["active_servers — A list of server insta…"]
    style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
  • active_servers — A list of server instances that are currently connected and operational.
  • failed_servers — A list of server instances that failed to connect or lost their connection.

Use these to build health checks and adaptive behavior:

from fastapi import FastAPI

app = FastAPI()

@app.get("/health/mcp")
async def mcp_health():
    active = manager.active_servers
    failed = manager.failed_servers
    return {
        "status": "degraded" if failed else "healthy",
        "active": [s.name for s in active],
        "failed": [s.name for s in failed],
        "total": len(active) + len(failed),
        "active_count": len(active),
    }

You can also use server status to adjust agent behavior dynamically:

async def adaptive_instructions(run_context, agent):
    active_names = {s.name for s in manager.active_servers}
    base = "You are an enterprise assistant."

    if "Database" not in active_names:
        base += (
            " The database server is currently unavailable. "
            "Let the user know you cannot look up records right now "
            "and suggest they try again in a few minutes."
        )

    if "Search" not in active_names:
        base += (
            " The search server is offline. You cannot search the "
            "knowledge base. Answer from your training data and note "
            "that results may not reflect the latest documentation."
        )

    return base

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)

Dropping Failed Servers

When a server fails, it stays in the manager's server list by default. The agent SDK will skip it when listing tools, but it still occupies a connection slot and may cause timeouts if the agent tries to reach it.

drop_failed_servers() removes failed servers from the manager entirely:

async def run_with_cleanup():
    async with manager:
        # Some servers may have failed to connect
        if manager.failed_servers:
            failed_names = [s.name for s in manager.failed_servers]
            print(f"Dropping failed servers: {failed_names}")
            manager.drop_failed_servers()

        # Now only healthy servers remain
        result = await Runner.run(agent, "Check my recent orders")
        return result.final_output

This is useful when you know a server will not recover during the current session. Dropping it prevents the agent from wasting tokens generating tool calls that will fail.

Reconnection Strategies

For long-running services, you need a strategy to reconnect failed servers. The manager itself does not auto-reconnect, but you can build reconnection logic on top of it:

import asyncio
import logging

logger = logging.getLogger(__name__)

class ResilientMCPManager:
    def __init__(self, servers, reconnect_interval=60, max_retries=5):
        self.all_servers = servers
        self.manager = MCPServerManager(servers=servers)
        self.reconnect_interval = reconnect_interval
        self.max_retries = max_retries
        self.retry_counts = {s.name: 0 for s in servers}
        self._reconnect_task = None

    async def __aenter__(self):
        await self.manager.__aenter__()
        self._reconnect_task = asyncio.create_task(self._reconnect_loop())
        return self

    async def __aexit__(self, *args):
        if self._reconnect_task:
            self._reconnect_task.cancel()
        await self.manager.__aexit__(*args)

    async def _reconnect_loop(self):
        while True:
            await asyncio.sleep(self.reconnect_interval)
            failed = list(self.manager.failed_servers)
            for server in failed:
                if self.retry_counts[server.name] >= self.max_retries:
                    logger.warning(
                        f"Server {server.name} exceeded max retries, skipping"
                    )
                    continue
                try:
                    logger.info(f"Attempting reconnect: {server.name}")
                    await server.connect()
                    self.retry_counts[server.name] = 0
                    logger.info(f"Reconnected: {server.name}")
                except Exception as e:
                    self.retry_counts[server.name] += 1
                    logger.error(
                        f"Reconnect failed for {server.name}: {e} "
                        f"(attempt {self.retry_counts[server.name]}/"
                        f"{self.max_retries})"
                    )

    @property
    def active_servers(self):
        return self.manager.active_servers

    @property
    def failed_servers(self):
        return self.manager.failed_servers

Integrating with Agent Runner

Here is a complete example that ties the manager into an agent service:

from agents import Agent, Runner
from fastapi import FastAPI
import logging

logger = logging.getLogger(__name__)
app = FastAPI()

resilient_manager = ResilientMCPManager(
    servers=[filesystem, database, search, custom_tools],
    reconnect_interval=30,
    max_retries=10,
)

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)

@app.on_event("startup")
async def startup():
    await resilient_manager.__aenter__()
    active = resilient_manager.active_servers
    failed = resilient_manager.failed_servers
    logger.info(f"MCP servers active: {[s.name for s in active]}")
    if failed:
        logger.warning(f"MCP servers failed: {[s.name for s in failed]}")

@app.on_event("shutdown")
async def shutdown():
    await resilient_manager.__aexit__(None, None, None)

@app.post("/chat")
async def chat(message: str):
    active = resilient_manager.active_servers
    if not active:
        return {"error": "All MCP servers are unavailable", "status": 503}

    result = await Runner.run(agent, message)
    return {
        "response": result.final_output,
        "servers_used": [s.name for s in active],
    }

Best Practices for Multi-Server Agents

  1. Always use MCPServerManager when connecting to two or more MCP servers. Direct management of multiple servers leads to inconsistent error handling.
  2. Categorize servers by criticality. Fail fast if essential servers are down. Degrade gracefully for optional ones.
  3. Set connection timeouts. Do not let a slow server block the entire startup sequence.
  4. Drop permanently failed servers. If a server exceeds your retry limit, remove it to prevent useless tool calls.
  5. Expose health endpoints. Report which servers are active and wire this into your alerting system.
  6. Log every lifecycle event. Connection, disconnection, and reconnection attempts should all produce structured log entries with server names and error details.

MCPServerManager transforms multi-server MCP from a fragile setup into a resilient system. By tracking server health, supporting graceful degradation, and enabling reconnection, it gives your production agents the reliability they need to serve real users.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

Technical Guides

Building Voice Agents with the OpenAI Realtime API: Full Tutorial

Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.

Technical Guides

How AI Voice Agents Actually Work: Technical Deep Dive (2026 Edition)

A full technical walkthrough of how modern AI voice agents work — speech-to-text, LLM orchestration, TTS, tool calling, and sub-second latency.

AI Interview Prep

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

Real AI system design interview questions from Google, Meta, OpenAI, and Anthropic. Covers LLM serving, RAG pipelines, recommendation systems, AI agents, and more — with detailed answer frameworks.

AI Interview Prep

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

AI Interview Prep

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.