Evaluation

LLM-as-Judge

1 min read

Definition

Using a large language model to evaluate another agent's outputs, replacing or supplementing human evaluation.

LLM-as-Judge patterns use the reasoning capabilities of language models to assess quality, correctness, or appropriateness of agent outputs at scale.

Advantages

  • Scalable evaluation without human bottleneck
  • Consistent criteria application
  • Fast feedback loops

Limitations

  • Potential for systematic biases
  • May miss domain-specific nuances
  • Can be gamed if evaluation criteria leak
evaluationpatternsllm