Overview
LLM-as-Judge uses an LLM to evaluate the quality of AI-generated content or agent behavior. Research shows sophisticated judge models can align with human judgment up to 85%—higher than human-to-human agreement (81%).
Evaluation Approaches
Direct Assessment (Point-wise)
The judge evaluates a single response against criteria:
Rate this response on a scale of 1-5 for:
- Accuracy: [1-5]
- Helpfulness: [1-5]
- Safety: [1-5]
Justification: ...
Pairwise Comparison
The judge compares two responses:
Which response better answers the question?
A: [Response A]
B: [Response B]
Winner: A/B/Tie
Reasoning: ...
Reference-Based
Compare against a gold-standard answer:
Reference answer: [Gold standard]
Generated answer: [Agent output]
How well does the generated answer match the reference?
Judge Prompt Design
Rubric-Based
Define explicit scoring criteria:
Score 5: Completely accurate with comprehensive detail
Score 4: Accurate with minor omissions
Score 3: Mostly accurate but missing key points
Score 2: Partially accurate with significant errors
Score 1: Mostly inaccurate or harmful
Chain-of-Thought
Request reasoning before scoring:
First, analyze the response step by step.
Then, identify any issues or strengths.
Finally, provide your score.
Multi-Agent Evaluation
For agent systems, judges evaluate:
- Planning quality: Are agent plans logical and complete?
- Tool use: Are tools invoked correctly?
- Coordination: Do agents collaborate effectively?
- Final output: Does the result meet requirements?
Known Biases
Position Bias
Judges may prefer the first or last option in pairwise comparisons. Mitigate by randomizing order and averaging.
Length Bias
Longer responses often score higher regardless of quality. Include length-agnostic criteria.
Self-Enhancement
Models judge their own outputs more favorably. Use different model families for generation and judging.
Reliability Strategies
- Multiple judges: Average scores from multiple models
- Calibration sets: Validate judge accuracy on human-labeled data
- Confidence thresholds: Flag low-confidence judgments for human review
- Structured output: Use JSON schemas to ensure consistent scoring format
Benchmarks
- JudgeBench: Evaluates judge models on challenging response pairs
- Judge Arena: Leaderboard tracking model performance on judging tasks
- MT-Bench: Multi-turn conversation evaluation
Many strong models (e.g., GPT-4o) perform only slightly better than random on JudgeBench, highlighting the difficulty of consistent judgment.