LLM-as-Judge patterns use the reasoning capabilities of language models to assess quality, correctness, or appropriateness of agent outputs at scale.
Advantages
- Scalable evaluation without human bottleneck
- Consistent criteria application
- Fast feedback loops
Limitations
- Potential for systematic biases
- May miss domain-specific nuances
- Can be gamed if evaluation criteria leak