---
title: "MCPServerManager: Orchestrating Multiple MCP Servers"
description: "Use MCPServerManager to orchestrate multiple MCP server connections with automatic failure detection, reconnection strategies, and health monitoring using active_servers, failed_servers, and drop_failed_servers."
canonical: https://callsphere.ai/blog/mcp-server-manager-orchestrating-multiple-servers
category: "Learn Agentic AI"
tags: ["OpenAI", "MCP", "Server Manager", "Multi-Server"]
author: "CallSphere Team"
published: 2026-03-14T00:00:00.000Z
updated: 2026-05-07T07:12:32.530Z
---

# MCPServerManager: Orchestrating Multiple MCP Servers

> Use MCPServerManager to orchestrate multiple MCP server connections with automatic failure detection, reconnection strategies, and health monitoring using active_servers, failed_servers, and drop_failed_servers.

## The Multi-Server Challenge

Production agents rarely use a single MCP server. A typical enterprise agent might connect to:

- A filesystem server for document access
- A database server for customer records
- A search server for knowledge base queries
- A custom business logic server for domain operations
- An email server for sending notifications

When everything is healthy, this works well. But in production, servers crash, network connections drop, and deployments restart services. A single failed server can break the entire agent if connections are not managed properly.

MCPServerManager is the orchestration layer that handles multi-server lifecycle management. It tracks which servers are active, which have failed, and provides strategies for recovery — so your agent degrades gracefully instead of crashing.

## Setting Up MCPServerManager

MCPServerManager wraps multiple MCP server instances and provides a unified interface for connection management:

```mermaid
flowchart LR
    HOST(["MCP host
Claude Desktop or IDE"])
    CLIENT["MCP client"]
    subgraph SERVERS["MCP Servers"]
        S1["Filesystem server"]
        S2["GitHub server"]
        S3["Postgres server"]
        SX["Custom tool server"]
    end
    LLM["LLM session"]
    OUT(["Grounded action"])
    HOST  CLIENT
    CLIENT |stdio or HTTP+SSE| S1
    CLIENT  S2
    CLIENT  S3
    CLIENT  SX
    CLIENT --> LLM --> OUT
    style HOST fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style CLIENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
```

```python
from agents.mcp import (
    MCPServerStdio,
    MCPServerStreamableHTTP,
    MCPServerManager,
)

# Define your servers
filesystem = MCPServerStdio(
    name="Filesystem",
    params={
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/data"],
    },
    cache_tools_list=True,
)

database = MCPServerStreamableHTTP(
    name="Database",
    params={"url": "http://db-mcp:8001/mcp"},
    cache_tools_list=True,
)

search = MCPServerStreamableHTTP(
    name="Search",
    params={"url": "http://search-mcp:8002/mcp"},
    cache_tools_list=True,
)

custom_tools = MCPServerStdio(
    name="BusinessLogic",
    params={
        "command": "python",
        "args": ["business_logic_server.py"],
    },
    cache_tools_list=True,
)

# Create the manager
manager = MCPServerManager(
    servers=[filesystem, database, search, custom_tools]
)
```

## Connecting with the Manager

Use the manager as an async context manager. It handles connecting to all servers and provides status tracking:

```python
from agents import Agent, Runner

agent = Agent(
    name="Enterprise Assistant",
    instructions="You help employees with file access, data queries, and business operations.",
    mcp_servers=[filesystem, database, search, custom_tools],
)

async def run_agent(user_message: str):
    async with manager:
        # Check which servers connected successfully
        active = manager.active_servers
        failed = manager.failed_servers

        print(f"Active servers: {[s.name for s in active]}")
        print(f"Failed servers: {[s.name for s in failed]}")

        if not active:
            return "All MCP servers are unavailable. Please try again later."

        result = await Runner.run(agent, user_message)
        return result.final_output
```

The key difference from managing servers individually is that MCPServerManager does not raise an exception if one server fails to connect. Instead, it tracks the failure and lets you decide how to respond.

## Monitoring Active and Failed Servers

MCPServerManager provides two properties for monitoring server health:

- `active_servers` — A list of server instances that are currently connected and operational.
- `failed_servers` — A list of server instances that failed to connect or lost their connection.

Use these to build health checks and adaptive behavior:

```python
from fastapi import FastAPI

app = FastAPI()

@app.get("/health/mcp")
async def mcp_health():
    active = manager.active_servers
    failed = manager.failed_servers
    return {
        "status": "degraded" if failed else "healthy",
        "active": [s.name for s in active],
        "failed": [s.name for s in failed],
        "total": len(active) + len(failed),
        "active_count": len(active),
    }
```

You can also use server status to adjust agent behavior dynamically:

```python
async def adaptive_instructions(run_context, agent):
    active_names = {s.name for s in manager.active_servers}
    base = "You are an enterprise assistant."

    if "Database" not in active_names:
        base += (
            " The database server is currently unavailable. "
            "Let the user know you cannot look up records right now "
            "and suggest they try again in a few minutes."
        )

    if "Search" not in active_names:
        base += (
            " The search server is offline. You cannot search the "
            "knowledge base. Answer from your training data and note "
            "that results may not reflect the latest documentation."
        )

    return base

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)
```

## Dropping Failed Servers

When a server fails, it stays in the manager's server list by default. The agent SDK will skip it when listing tools, but it still occupies a connection slot and may cause timeouts if the agent tries to reach it.

`drop_failed_servers()` removes failed servers from the manager entirely:

```python
async def run_with_cleanup():
    async with manager:
        # Some servers may have failed to connect
        if manager.failed_servers:
            failed_names = [s.name for s in manager.failed_servers]
            print(f"Dropping failed servers: {failed_names}")
            manager.drop_failed_servers()

        # Now only healthy servers remain
        result = await Runner.run(agent, "Check my recent orders")
        return result.final_output
```

This is useful when you know a server will not recover during the current session. Dropping it prevents the agent from wasting tokens generating tool calls that will fail.

## Reconnection Strategies

For long-running services, you need a strategy to reconnect failed servers. The manager itself does not auto-reconnect, but you can build reconnection logic on top of it:

```python
import asyncio
import logging

logger = logging.getLogger(__name__)

class ResilientMCPManager:
    def __init__(self, servers, reconnect_interval=60, max_retries=5):
        self.all_servers = servers
        self.manager = MCPServerManager(servers=servers)
        self.reconnect_interval = reconnect_interval
        self.max_retries = max_retries
        self.retry_counts = {s.name: 0 for s in servers}
        self._reconnect_task = None

    async def __aenter__(self):
        await self.manager.__aenter__()
        self._reconnect_task = asyncio.create_task(self._reconnect_loop())
        return self

    async def __aexit__(self, *args):
        if self._reconnect_task:
            self._reconnect_task.cancel()
        await self.manager.__aexit__(*args)

    async def _reconnect_loop(self):
        while True:
            await asyncio.sleep(self.reconnect_interval)
            failed = list(self.manager.failed_servers)
            for server in failed:
                if self.retry_counts[server.name] >= self.max_retries:
                    logger.warning(
                        f"Server {server.name} exceeded max retries, skipping"
                    )
                    continue
                try:
                    logger.info(f"Attempting reconnect: {server.name}")
                    await server.connect()
                    self.retry_counts[server.name] = 0
                    logger.info(f"Reconnected: {server.name}")
                except Exception as e:
                    self.retry_counts[server.name] += 1
                    logger.error(
                        f"Reconnect failed for {server.name}: {e} "
                        f"(attempt {self.retry_counts[server.name]}/"
                        f"{self.max_retries})"
                    )

    @property
    def active_servers(self):
        return self.manager.active_servers

    @property
    def failed_servers(self):
        return self.manager.failed_servers
```

## Integrating with Agent Runner

Here is a complete example that ties the manager into an agent service:

```python
from agents import Agent, Runner
from fastapi import FastAPI
import logging

logger = logging.getLogger(__name__)
app = FastAPI()

resilient_manager = ResilientMCPManager(
    servers=[filesystem, database, search, custom_tools],
    reconnect_interval=30,
    max_retries=10,
)

agent = Agent(
    name="Enterprise Assistant",
    instructions=adaptive_instructions,
    mcp_servers=[filesystem, database, search, custom_tools],
)

@app.on_event("startup")
async def startup():
    await resilient_manager.__aenter__()
    active = resilient_manager.active_servers
    failed = resilient_manager.failed_servers
    logger.info(f"MCP servers active: {[s.name for s in active]}")
    if failed:
        logger.warning(f"MCP servers failed: {[s.name for s in failed]}")

@app.on_event("shutdown")
async def shutdown():
    await resilient_manager.__aexit__(None, None, None)

@app.post("/chat")
async def chat(message: str):
    active = resilient_manager.active_servers
    if not active:
        return {"error": "All MCP servers are unavailable", "status": 503}

    result = await Runner.run(agent, message)
    return {
        "response": result.final_output,
        "servers_used": [s.name for s in active],
    }
```

## Best Practices for Multi-Server Agents

1. **Always use MCPServerManager** when connecting to two or more MCP servers. Direct management of multiple servers leads to inconsistent error handling.
2. **Categorize servers by criticality.** Fail fast if essential servers are down. Degrade gracefully for optional ones.
3. **Set connection timeouts.** Do not let a slow server block the entire startup sequence.
4. **Drop permanently failed servers.** If a server exceeds your retry limit, remove it to prevent useless tool calls.
5. **Expose health endpoints.** Report which servers are active and wire this into your alerting system.
6. **Log every lifecycle event.** Connection, disconnection, and reconnection attempts should all produce structured log entries with server names and error details.

MCPServerManager transforms multi-server MCP from a fragile setup into a resilient system. By tracking server health, supporting graceful degradation, and enabling reconnection, it gives your production agents the reliability they need to serve real users.

---

Source: https://callsphere.ai/blog/mcp-server-manager-orchestrating-multiple-servers