The Multi-Server Challenge

Production agents rarely use a single MCP server. A typical enterprise agent might connect to:

A filesystem server for document access
A database server for customer records
A search server for knowledge base queries
A custom business logic server for domain operations
An email server for sending notifications

When everything is healthy, this works well. But in production, servers crash, network connections drop, and deployments restart services. A single failed server can break the entire agent if connections are not managed properly.

MCPServerManager is the orchestration layer that handles multi-server lifecycle management. It tracks which servers are active, which have failed, and provides strategies for recovery — so your agent degrades gracefully instead of crashing.

Setting Up MCPServerManager

MCPServerManager wraps multiple MCP server instances and provides a unified interface for connection management:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    HOST(["MCP host<br/>Claude Desktop or IDE"])
    CLIENT["MCP client"]
    subgraph SERVERS["MCP Servers"]
        S1["Filesystem server"]
        S2["GitHub server"]
        S3["Postgres server"]
        SX["Custom tool server"]
    end
    LLM["LLM session"]
    OUT(["Grounded action"])
    HOST <--> CLIENT
    CLIENT <-->|stdio or HTTP+SSE| S1
    CLIENT <--> S2
    CLIENT <--> S3
    CLIENT <--> SX
    CLIENT --> LLM --> OUT
    style HOST fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style CLIENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

from agents.mcp import (
    MCPServerStdio,
    MCPServerStreamableHTTP,
    MCPServerManager,
)

# Define your servers
filesystem = MCPServerStdio(
    name="Filesystem",
    params={
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"],
    },
    cache_tools_list=True,
)

database = MCPServerStreamableHTTP(
    name="Database",
    params={"url": "http://db-mcp:8001/mcp"},
    cache_tools_list=True,
)

search = MCPServerStreamableHTTP(
    name="Search",
    params={"url": "http://search-mcp:8002/mcp"},
    cache_tools_list=True,
)

custom_tools = MCPServerStdio(
    name="BusinessLogic",
    params={
        "command": "python",
        "args": ["business_logic_server.py"],
    },
    cache_tools_list=True,
)

# Create the manager
manager = MCPServerManager(
    servers=[filesystem, database, search, custom_tools]
)

Connecting with the Manager

Use the manager as an async context manager. It handles connecting to all servers and provides status tracking:

from agents import Agent, Runner

agent = Agent(
    name="Enterprise Assistant",
    instructions="You help employees with file access, data queries, and business operations.",
    mcp_servers=[filesystem, database, search, custom_tools],
)

async def run_agent(user_message: str):
    async with manager:
        # Check which servers connected successfully
        active = manager.active_servers
        failed = manager.failed_servers

        print(f"Active servers: {[s.name for s in active]}")
        print(f"Failed servers: {[s.name for s in failed]}")

        if not active:
            return "All MCP servers are unavailable. Please try again later."

        result = await Runner.run(agent, user_message)
        return result.final_output

The key difference from managing servers individually is that MCPServerManager does not raise an exception if one server fails to connect. Instead, it tracks the failure and lets you decide how to respond.

Monitoring Active and Failed Servers

MCPServerManager provides two properties for monitoring server health:

active_servers — A list of server instances that are currently connected and operational.
failed_servers — A list of server instances that failed to connect or lost their connection.

Use these to build health checks and adaptive behavior:

from fastapi import FastAPI

app = FastAPI()

@app.get("/health/mcp")
async def mcp_health():
    active = manager.active_servers
    failed = manager.failed_servers
    return {
        "status": "degraded" if failed else "healthy",
        "active": [s.name for s in active],
        "failed": [s.name for s in failed],
        "total": len(active) + len(failed),
        "active_count": len(active),
    }

You can also use server status to adjust agent behavior dynamically:

async def adaptive_instructions(run_context, agent):
    active_names = {s.name for s in manager.active_servers}
    base = "You are an enterprise assistant."

    if "Database" not in active_names:
        base += (
            " The database server is currently unavailable. "
            "Let the user know you cannot look up records right now "
            "and suggest they try again in a few minutes."
        )

    if "Search" not in active_names:
        base += (
            " The search server is offline. You cannot search the "
            "knowledge base. Answer from your training data and note "
            "that results may not reflect the latest documentation."
        )

    return base

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)

Dropping Failed Servers

When a server fails, it stays in the manager's server list by default. The agent SDK will skip it when listing tools, but it still occupies a connection slot and may cause timeouts if the agent tries to reach it.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

drop_failed_servers() removes failed servers from the manager entirely:

async def run_with_cleanup():
    async with manager:
        # Some servers may have failed to connect
        if manager.failed_servers:
            failed_names = [s.name for s in manager.failed_servers]
            print(f"Dropping failed servers: {failed_names}")
            manager.drop_failed_servers()

        # Now only healthy servers remain
        result = await Runner.run(agent, "Check my recent orders")
        return result.final_output

This is useful when you know a server will not recover during the current session. Dropping it prevents the agent from wasting tokens generating tool calls that will fail.

Reconnection Strategies

For long-running services, you need a strategy to reconnect failed servers. The manager itself does not auto-reconnect, but you can build reconnection logic on top of it:

import asyncio
import logging

logger = logging.getLogger(__name__)

class ResilientMCPManager:
    def __init__(self, servers, reconnect_interval=60, max_retries=5):
        self.all_servers = servers
        self.manager = MCPServerManager(servers=servers)
        self.reconnect_interval = reconnect_interval
        self.max_retries = max_retries
        self.retry_counts = {s.name: 0 for s in servers}
        self._reconnect_task = None

    async def __aenter__(self):
        await self.manager.__aenter__()
        self._reconnect_task = asyncio.create_task(self._reconnect_loop())
        return self

    async def __aexit__(self, *args):
        if self._reconnect_task:
            self._reconnect_task.cancel()
        await self.manager.__aexit__(*args)

    async def _reconnect_loop(self):
        while True:
            await asyncio.sleep(self.reconnect_interval)
            failed = list(self.manager.failed_servers)
            for server in failed:
                if self.retry_counts[server.name] >= self.max_retries:
                    logger.warning(
                        f"Server {server.name} exceeded max retries, skipping"
                    )
                    continue
                try:
                    logger.info(f"Attempting reconnect: {server.name}")
                    await server.connect()
                    self.retry_counts[server.name] = 0
                    logger.info(f"Reconnected: {server.name}")
                except Exception as e:
                    self.retry_counts[server.name] += 1
                    logger.error(
                        f"Reconnect failed for {server.name}: {e} "
                        f"(attempt {self.retry_counts[server.name]}/"
                        f"{self.max_retries})"
                    )

    @property
    def active_servers(self):
        return self.manager.active_servers

    @property
    def failed_servers(self):
        return self.manager.failed_servers

Integrating with Agent Runner

Here is a complete example that ties the manager into an agent service:

from agents import Agent, Runner
from fastapi import FastAPI
import logging

logger = logging.getLogger(__name__)
app = FastAPI()

resilient_manager = ResilientMCPManager(
    servers=[filesystem, database, search, custom_tools],
    reconnect_interval=30,
    max_retries=10,
)

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)

@app.on_event("startup")
async def startup():
    await resilient_manager.__aenter__()
    active = resilient_manager.active_servers
    failed = resilient_manager.failed_servers
    logger.info(f"MCP servers active: {[s.name for s in active]}")
    if failed:
        logger.warning(f"MCP servers failed: {[s.name for s in failed]}")

@app.on_event("shutdown")
async def shutdown():
    await resilient_manager.__aexit__(None, None, None)

@app.post("/chat")
async def chat(message: str):
    active = resilient_manager.active_servers
    if not active:
        return {"error": "All MCP servers are unavailable", "status": 503}

    result = await Runner.run(agent, message)
    return {
        "response": result.final_output,
        "servers_used": [s.name for s in active],
    }

Best Practices for Multi-Server Agents

Always use MCPServerManager when connecting to two or more MCP servers. Direct management of multiple servers leads to inconsistent error handling.
Categorize servers by criticality. Fail fast if essential servers are down. Degrade gracefully for optional ones.
Set connection timeouts. Do not let a slow server block the entire startup sequence.
Drop permanently failed servers. If a server exceeds your retry limit, remove it to prevent useless tool calls.
Expose health endpoints. Report which servers are active and wire this into your alerting system.
Log every lifecycle event. Connection, disconnection, and reconnection attempts should all produce structured log entries with server names and error details.

MCPServerManager transforms multi-server MCP from a fragile setup into a resilient system. By tracking server health, supporting graceful degradation, and enabling reconnection, it gives your production agents the reliability they need to serve real users.

MCPServerManager: Orchestrating Multiple MCP Servers

The Multi-Server Challenge

Setting Up MCPServerManager

Connecting with the Manager

Monitoring Active and Failed Servers

Dropping Failed Servers

Reconnection Strategies

Integrating with Agent Runner

Best Practices for Multi-Server Agents

Try CallSphere AI Voice Agents

Related Articles You May Like

Desktop AI Agents in 2026: Project Arc, Claude Cowork, OpenAI Agents Compared

OpenAI Frontier: Model-Native Orchestration Is the Default in 2026

Building Multi-Agent Systems With MCP, A2A, And CallSphere As A Node

Gemini Enterprise vs Anthropic vs OpenAI Frontier: 2026 Comparison

Anthropic's Financial Services Platform: State of Play in May 2026

MCP vs A2A: When To Use Which Protocol (2026 Decision Guide)