Most production AI applications are compound systems, not single model calls. This creates evaluation challenges.
Components
- Multiple models (different sizes/capabilities)
- Retrieval systems
- External tools
- Orchestration logic
- Guardrails
Implications
- End-to-end evaluation needed
- Component interactions matter
- More failure modes