Overview
Many multi-agent architectures rely on a central orchestrator to coordinate tasks, route requests, and manage sub-agents. When this orchestrator becomes a single point of failure (SPOF), its failure cascades to complete system unavailability.
The SPOF Pattern
┌─────────────────┐
│ Orchestrator │ ← Single Point of Failure
│ (Supervisor) │
└────────┬────────┘
│
┌─────────────┼─────────────┐
↓ ↓ ↓
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Agent A │ │ Agent B │ │ Agent C │
└─────────┘ └─────────┘ └─────────┘
If orchestrator fails:
- No task routing
- No coordination
- Agents sit idle
- System appears dead
Failure Modes
Orchestrator Crash
Orchestrator process terminates unexpectedly
- Memory exhaustion
- Unhandled exception
- Infrastructure failure
- Deployment error
All in-flight tasks lost
New requests rejected
No coordination possible
Orchestrator Overload
Request volume exceeds orchestrator capacity
Time → [Normal] [High Load] [Overload] [Collapse]
100 req 500 req 2000 req Timeout
Orchestrator becomes bottleneck
Response times degrade exponentially
Eventually stops responding entirely
Orchestrator Logic Failure
Orchestrator enters bad state:
- Infinite routing loop
- Deadlock waiting for sub-agent
- Resource leak accumulating
- Configuration corruption
System appears up but non-functional
Network Partition
Network
┌──────────┐ Partition ┌──────────┐
│Orchestrator│ X─────────X │ Agents │
└──────────┘ └──────────┘
Orchestrator running but can't reach agents
Agents running but can't receive tasks
System functionally dead
Cascade Effects
Task Queue Backup
Orchestrator down for 5 minutes:
- 500 new requests queued
- Orchestrator recovers
- Struggles to process backlog
- Performance degraded for hours
State Loss
Orchestrator maintained:
- In-flight task state
- Agent availability map
- Routing decisions
- Conversation context
Crash loses all ephemeral state
Recovery requires expensive reconstruction
Agent Confusion
Agent A: Waiting for instructions...
Agent B: Waiting for instructions...
Agent C: (Times out, starts autonomous action)
Agent D: (Receives stale instruction from queue)
Uncoordinated chaos when orchestrator returns
Anti-Patterns
Stateful Orchestrator
# Dangerous: All state in memory
class StatefulOrchestrator:
def __init__(self):
self.active_tasks = {} # Lost on crash
self.agent_states = {} # Lost on crash
self.routing_cache = {} # Lost on crash
No Health Checking
# No way to detect orchestrator failure
while True:
task = queue.get()
orchestrator.route(task) # Blocks forever if down
Tight Coupling
# Agents can't function without orchestrator
class DependentAgent:
def process(self, task):
instructions = orchestrator.get_instructions(task)
# Blocks if orchestrator unavailable
return self.execute(instructions)
Resilience Patterns
Active-Passive Failover
┌────────────────┐ ┌────────────────┐
│ Orchestrator │ │ Standby │
│ (Active) │───→│ Orchestrator │
└────────────────┘ └────────────────┘
↓ ↓
[State Sync] [Monitors]
↓ ↓
[Heartbeat] ←──────────────┘
If active fails:
1. Standby detects via heartbeat
2. Standby takes over
3. Clients redirect to standby
Distributed Orchestration
┌─────────────────────────────────────┐
│ Orchestrator Cluster │
│ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │ Node1 │ │ Node2 │ │ Node3 │ │
│ └───────┘ └───────┘ └───────┘ │
│ [Consensus Protocol] │
└─────────────────────────────────────┘
Any node can handle requests
State replicated across nodes
Survives N-1 node failures
Autonomous Agent Fallback
class ResilientAgent:
async def process(self, task):
try:
instructions = await asyncio.wait_for(
self.orchestrator.get_instructions(task),
timeout=5.0
)
return await self.execute(instructions)
except asyncio.TimeoutError:
# Orchestrator unavailable - fall back to autonomous mode
return await self.autonomous_process(task)