MCPServerManager: Orchestrating Multiple MCP Servers
Use MCPServerManager to orchestrate multiple MCP server connections with automatic failure detection, reconnection strategies, and health monitoring using active_servers, failed_servers, and drop_failed_servers.
The Multi-Server Challenge
Production agents rarely use a single MCP server. A typical enterprise agent might connect to:
- A filesystem server for document access
- A database server for customer records
- A search server for knowledge base queries
- A custom business logic server for domain operations
- An email server for sending notifications
When everything is healthy, this works well. But in production, servers crash, network connections drop, and deployments restart services. A single failed server can break the entire agent if connections are not managed properly.
MCPServerManager is the orchestration layer that handles multi-server lifecycle management. It tracks which servers are active, which have failed, and provides strategies for recovery — so your agent degrades gracefully instead of crashing.
Setting Up MCPServerManager
MCPServerManager wraps multiple MCP server instances and provides a unified interface for connection management:
flowchart TD
START["MCPServerManager: Orchestrating Multiple MCP Serv…"] --> A
A["The Multi-Server Challenge"]
A --> B
B["Setting Up MCPServerManager"]
B --> C
C["Connecting with the Manager"]
C --> D
D["Monitoring Active and Failed Servers"]
D --> E
E["Dropping Failed Servers"]
E --> F
F["Reconnection Strategies"]
F --> G
G["Integrating with Agent Runner"]
G --> H
H["Best Practices for Multi-Server Agents"]
H --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
from agents.mcp import (
MCPServerStdio,
MCPServerStreamableHTTP,
MCPServerManager,
)
# Define your servers
filesystem = MCPServerStdio(
name="Filesystem",
params={
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"],
},
cache_tools_list=True,
)
database = MCPServerStreamableHTTP(
name="Database",
params={"url": "http://db-mcp:8001/mcp"},
cache_tools_list=True,
)
search = MCPServerStreamableHTTP(
name="Search",
params={"url": "http://search-mcp:8002/mcp"},
cache_tools_list=True,
)
custom_tools = MCPServerStdio(
name="BusinessLogic",
params={
"command": "python",
"args": ["business_logic_server.py"],
},
cache_tools_list=True,
)
# Create the manager
manager = MCPServerManager(
servers=[filesystem, database, search, custom_tools]
)
Connecting with the Manager
Use the manager as an async context manager. It handles connecting to all servers and provides status tracking:
from agents import Agent, Runner
agent = Agent(
name="Enterprise Assistant",
instructions="You help employees with file access, data queries, and business operations.",
mcp_servers=[filesystem, database, search, custom_tools],
)
async def run_agent(user_message: str):
async with manager:
# Check which servers connected successfully
active = manager.active_servers
failed = manager.failed_servers
print(f"Active servers: {[s.name for s in active]}")
print(f"Failed servers: {[s.name for s in failed]}")
if not active:
return "All MCP servers are unavailable. Please try again later."
result = await Runner.run(agent, user_message)
return result.final_output
The key difference from managing servers individually is that MCPServerManager does not raise an exception if one server fails to connect. Instead, it tracks the failure and lets you decide how to respond.
Monitoring Active and Failed Servers
MCPServerManager provides two properties for monitoring server health:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
flowchart TD
CENTER(("Core Concepts"))
CENTER --> N0["A filesystem server for document access"]
CENTER --> N1["A database server for customer records"]
CENTER --> N2["A search server for knowledge base quer…"]
CENTER --> N3["A custom business logic server for doma…"]
CENTER --> N4["An email server for sending notificatio…"]
CENTER --> N5["active_servers — A list of server insta…"]
style CENTER fill:#4f46e5,stroke:#4338ca,color:#fff
active_servers— A list of server instances that are currently connected and operational.failed_servers— A list of server instances that failed to connect or lost their connection.
Use these to build health checks and adaptive behavior:
from fastapi import FastAPI
app = FastAPI()
@app.get("/health/mcp")
async def mcp_health():
active = manager.active_servers
failed = manager.failed_servers
return {
"status": "degraded" if failed else "healthy",
"active": [s.name for s in active],
"failed": [s.name for s in failed],
"total": len(active) + len(failed),
"active_count": len(active),
}
You can also use server status to adjust agent behavior dynamically:
async def adaptive_instructions(run_context, agent):
active_names = {s.name for s in manager.active_servers}
base = "You are an enterprise assistant."
if "Database" not in active_names:
base += (
" The database server is currently unavailable. "
"Let the user know you cannot look up records right now "
"and suggest they try again in a few minutes."
)
if "Search" not in active_names:
base += (
" The search server is offline. You cannot search the "
"knowledge base. Answer from your training data and note "
"that results may not reflect the latest documentation."
)
return base
agent = Agent(
name="Enterprise Assistant",
instructions=adaptive_instructions,
mcp_servers=[filesystem, database, search, custom_tools],
)
Dropping Failed Servers
When a server fails, it stays in the manager's server list by default. The agent SDK will skip it when listing tools, but it still occupies a connection slot and may cause timeouts if the agent tries to reach it.
drop_failed_servers() removes failed servers from the manager entirely:
async def run_with_cleanup():
async with manager:
# Some servers may have failed to connect
if manager.failed_servers:
failed_names = [s.name for s in manager.failed_servers]
print(f"Dropping failed servers: {failed_names}")
manager.drop_failed_servers()
# Now only healthy servers remain
result = await Runner.run(agent, "Check my recent orders")
return result.final_output
This is useful when you know a server will not recover during the current session. Dropping it prevents the agent from wasting tokens generating tool calls that will fail.
Reconnection Strategies
For long-running services, you need a strategy to reconnect failed servers. The manager itself does not auto-reconnect, but you can build reconnection logic on top of it:
import asyncio
import logging
logger = logging.getLogger(__name__)
class ResilientMCPManager:
def __init__(self, servers, reconnect_interval=60, max_retries=5):
self.all_servers = servers
self.manager = MCPServerManager(servers=servers)
self.reconnect_interval = reconnect_interval
self.max_retries = max_retries
self.retry_counts = {s.name: 0 for s in servers}
self._reconnect_task = None
async def __aenter__(self):
await self.manager.__aenter__()
self._reconnect_task = asyncio.create_task(self._reconnect_loop())
return self
async def __aexit__(self, *args):
if self._reconnect_task:
self._reconnect_task.cancel()
await self.manager.__aexit__(*args)
async def _reconnect_loop(self):
while True:
await asyncio.sleep(self.reconnect_interval)
failed = list(self.manager.failed_servers)
for server in failed:
if self.retry_counts[server.name] >= self.max_retries:
logger.warning(
f"Server {server.name} exceeded max retries, skipping"
)
continue
try:
logger.info(f"Attempting reconnect: {server.name}")
await server.connect()
self.retry_counts[server.name] = 0
logger.info(f"Reconnected: {server.name}")
except Exception as e:
self.retry_counts[server.name] += 1
logger.error(
f"Reconnect failed for {server.name}: {e} "
f"(attempt {self.retry_counts[server.name]}/"
f"{self.max_retries})"
)
@property
def active_servers(self):
return self.manager.active_servers
@property
def failed_servers(self):
return self.manager.failed_servers
Integrating with Agent Runner
Here is a complete example that ties the manager into an agent service:
from agents import Agent, Runner
from fastapi import FastAPI
import logging
logger = logging.getLogger(__name__)
app = FastAPI()
resilient_manager = ResilientMCPManager(
servers=[filesystem, database, search, custom_tools],
reconnect_interval=30,
max_retries=10,
)
agent = Agent(
name="Enterprise Assistant",
instructions=adaptive_instructions,
mcp_servers=[filesystem, database, search, custom_tools],
)
@app.on_event("startup")
async def startup():
await resilient_manager.__aenter__()
active = resilient_manager.active_servers
failed = resilient_manager.failed_servers
logger.info(f"MCP servers active: {[s.name for s in active]}")
if failed:
logger.warning(f"MCP servers failed: {[s.name for s in failed]}")
@app.on_event("shutdown")
async def shutdown():
await resilient_manager.__aexit__(None, None, None)
@app.post("/chat")
async def chat(message: str):
active = resilient_manager.active_servers
if not active:
return {"error": "All MCP servers are unavailable", "status": 503}
result = await Runner.run(agent, message)
return {
"response": result.final_output,
"servers_used": [s.name for s in active],
}
Best Practices for Multi-Server Agents
- Always use MCPServerManager when connecting to two or more MCP servers. Direct management of multiple servers leads to inconsistent error handling.
- Categorize servers by criticality. Fail fast if essential servers are down. Degrade gracefully for optional ones.
- Set connection timeouts. Do not let a slow server block the entire startup sequence.
- Drop permanently failed servers. If a server exceeds your retry limit, remove it to prevent useless tool calls.
- Expose health endpoints. Report which servers are active and wire this into your alerting system.
- Log every lifecycle event. Connection, disconnection, and reconnection attempts should all produce structured log entries with server names and error details.
MCPServerManager transforms multi-server MCP from a fragile setup into a resilient system. By tracking server health, supporting graceful degradation, and enabling reconnection, it gives your production agents the reliability they need to serve real users.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.