Overview
Graceful degradation failure occurs when multi-agent systems are designed as all-or-nothing—either fully functional or completely unavailable. When any component fails, the entire system fails, even when partial operation would be valuable.
Degradation Spectrum
Ideal Degradation:
100% ──────────────────────── Full functionality
80% ───────────────────── Some features disabled
60% ──────────────── Core features only
40% ─────────── Emergency mode
20% ──── Read-only
0% ─ Complete outage
Actual (No Degradation):
100% ──────────────────────── Full functionality
0% ─ Complete outage
Any failure → Total failure
Failure Cascade Patterns
Single Dependency Failure
System depends on Agent X
Agent X fails:
- Agent A: Can't get data from X → Fails
- Agent B: Can't route through X → Fails
- Agent C: Can't verify with X → Fails
- System: Complete outage
Should have been:
- Agent A: Uses cached data, marks as stale
- Agent B: Routes directly, skips optimization
- Agent C: Skips verification, flags for review
- System: Degraded but functional
Chain Reaction
External API fails
↓
Agent A can't fetch data
↓
Agent B can't process (no input)
↓
Agent C times out waiting
↓
Orchestrator marks all as failed
↓
System declares outage
No agent attempted to work with partial information
Resource Exhaustion Cascade
Database slow (not down):
- Queries take 10x longer
- Thread pools exhaust
- Connections time out
- Retry storms amplify load
- System collapses
Should have been:
- Shed load
- Return cached results
- Queue non-critical requests
- Maintain critical path
Anti-Patterns
Hard Dependencies
# Bad: Hard failure on any dependency
def process_request(request):
data = external_api.fetch() # Throws if unavailable
enriched = enrichment_service.enrich(data) # Throws
validated = validation_agent.validate(enriched) # Throws
return validated # All or nothing
# Good: Graceful degradation
def process_request(request):
data = external_api.fetch_or_default(DEFAULT_DATA)
enriched = enrichment_service.enrich_if_available(data)
validated = validation_agent.validate_or_flag(enriched)
return validated.with_degradation_status()
No Fallback Logic
# Bad: Only one path
result = premium_agent.analyze(data)
# Good: Fallback chain
result = (
premium_agent.analyze(data) or
standard_agent.analyze(data) or
basic_analysis(data) or
{"status": "analysis_unavailable", "data": data}
)
Eager Failure
# Bad: Fail immediately
if not all_agents_healthy():
raise SystemUnavailable()
# Good: Assess impact
unhealthy = get_unhealthy_agents()
if can_operate_without(unhealthy):
proceed_with_degradation(unhealthy)
else:
partial_shutdown(unhealthy)
Degradation Strategies
Feature Flags
class DegradedMode:
def __init__(self):
self.features = {
"premium_analysis": True,
"real_time_data": True,
"personalization": True,
"caching_only": False
}
def degrade(self, failed_component):
if failed_component == "analysis_agent":
self.features["premium_analysis"] = False
elif failed_component == "data_feed":
self.features["real_time_data"] = False
self.features["caching_only"] = True
Circuit Breakers
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.timeout = timeout
self.state = "CLOSED"
self.last_failure = None
async def call(self, func, fallback):
if self.state == "OPEN":
if time_since(self.last_failure) > self.timeout:
self.state = "HALF_OPEN"
else:
return await fallback()
try:
result = await func()
self.failures = 0
self.state = "CLOSED"
return result
except Exception:
self.failures += 1
self.last_failure = now()
if self.failures >= self.threshold:
self.state = "OPEN"
return await fallback()
Load Shedding
class LoadShedder:
def __init__(self, capacity):
self.capacity = capacity
self.current_load = 0
def should_accept(self, request):
if self.current_load >= self.capacity:
# Shed low-priority requests
if request.priority < HIGH:
return False, "System at capacity, try later"
return True, None