Evaluation

Inter-Rater Reliability

1 min read

What It Means

The degree to which different human evaluators agree when assessing the same agent outputs.

High inter-rater reliability indicates clear evaluation criteria. Low reliability suggests subjective or ambiguous standards.

Metrics

  • Cohen's Kappa: Agreement adjusted for chance
  • Krippendorff's Alpha: Works for multiple raters
  • ICC: Intraclass correlation coefficient

Improving Reliability

  • Clear rubrics with examples
  • Calibration sessions
  • Double-blind evaluation
evaluationhumanreliability