evaluationsimplecommon

LLM-as-Judge Pattern

Scalable quality assessment of agent outputs without human reviewers

Overview

The Challenge

Evaluating LLM agent outputs at scale is expensive with human reviewers, and traditional metrics cannot capture nuanced quality dimensions.

The Solution

Use a separate LLM (the "judge") to evaluate agent outputs against defined criteria, providing scalable, consistent quality assessment.

When to Use
  • High-volume output evaluation
  • Consistent scoring across large datasets
  • Rapid iteration on agent quality
  • Regression testing and benchmarking
When NOT to Use
  • Mission-critical decisions requiring human judgment
  • Highly subjective or creative evaluations
  • When judge model biases are not understood

Trade-offs

Advantages
  • +Scalable to millions of evaluations
  • +Consistent application of criteria
  • +Much faster than human review
  • +Can evaluate 24/7 without fatigue
Considerations
  • Judges have their own biases
  • May miss nuanced quality issues
  • Requires calibration against human judgment
  • Can be gamed by adversarial outputs
Implement this pattern with our SDK
Get RepKit

Deep Dive

Overview

LLM-as-Judge uses an LLM to evaluate the quality of AI-generated content or agent behavior. Research shows sophisticated judge models can align with human judgment up to 85%—higher than human-to-human agreement (81%).

Evaluation Approaches

Direct Assessment (Point-wise)

The judge evaluates a single response against criteria:

Rate this response on a scale of 1-5 for:
- Accuracy: [1-5]
- Helpfulness: [1-5]
- Safety: [1-5]

Justification: ...

Pairwise Comparison

The judge compares two responses:

Which response better answers the question?
A: [Response A]
B: [Response B]

Winner: A/B/Tie
Reasoning: ...

Reference-Based

Compare against a gold-standard answer:

Reference answer: [Gold standard]
Generated answer: [Agent output]
How well does the generated answer match the reference?

Judge Prompt Design

Rubric-Based

Define explicit scoring criteria:

Score 5: Completely accurate with comprehensive detail
Score 4: Accurate with minor omissions
Score 3: Mostly accurate but missing key points
Score 2: Partially accurate with significant errors
Score 1: Mostly inaccurate or harmful

Chain-of-Thought

Request reasoning before scoring:

First, analyze the response step by step.
Then, identify any issues or strengths.
Finally, provide your score.

Multi-Agent Evaluation

For agent systems, judges evaluate:

  • Planning quality: Are agent plans logical and complete?
  • Tool use: Are tools invoked correctly?
  • Coordination: Do agents collaborate effectively?
  • Final output: Does the result meet requirements?

Known Biases

Position Bias

Judges may prefer the first or last option in pairwise comparisons. Mitigate by randomizing order and averaging.

Length Bias

Longer responses often score higher regardless of quality. Include length-agnostic criteria.

Self-Enhancement

Models judge their own outputs more favorably. Use different model families for generation and judging.

Reliability Strategies

  • Multiple judges: Average scores from multiple models
  • Calibration sets: Validate judge accuracy on human-labeled data
  • Confidence thresholds: Flag low-confidence judgments for human review
  • Structured output: Use JSON schemas to ensure consistent scoring format

Benchmarks

  • JudgeBench: Evaluates judge models on challenging response pairs
  • Judge Arena: Leaderboard tracking model performance on judging tasks
  • MT-Bench: Multi-turn conversation evaluation

Many strong models (e.g., GPT-4o) perform only slightly better than random on JudgeBench, highlighting the difficulty of consistent judgment.

Example Scenarios

Customer Support Quality

A support bot generates thousands of responses daily. An LLM judge scores each response for helpfulness, accuracy, and tone, flagging low-scoring responses for human review.

OutcomeQuality issues caught within hours instead of days, 95% reduction in human review burden
Want to learn more patterns?
Explore Learning Paths
Considerations

LLM judges exhibit their own biases. Use calibration data, multiple judges, and human spot-checks to ensure reliability.

Dimension Scores
Safety
3/5
Accuracy
4/5
Cost
4/5
Speed
4/5
Implementation
Complexitysimple
Implementation Checklist
Evaluation prompts
Calibration dataset
0/2 complete
Tags
evaluationqualityautomatedllmbenchmarking

Was this pattern helpful?