Skip to main content
Academic

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

arXiv:2602.22585v1 Announce Type: new Abstract: Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted sc

J
Jodi M. Casabianca, Maggie Beiting-Parrish
· · 1 min read · 3 views

arXiv:2602.22585v1 Announce Type: new Abstract: Human evaluations play a central role in training and assessing AI models, yet these data are rarely treated as measurements subject to systematic error. This paper integrates psychometric rater models into the AI pipeline to improve the reliability and validity of conclusions drawn from human judgments. The paper reviews common rater effects, severity and centrality, that distort observed ratings, and demonstrates how item response theory rater models, particularly the multi-faceted Rasch model, can separate true output quality from rater behavior. Using the OpenAI summarization dataset as an empirical example, we show how adjusting for rater severity produces corrected estimates of summary quality and provides diagnostic insight into rater performance. Incorporating psychometric modeling into human-in-the-loop evaluation offers more principled and transparent use of human data, enabling developers to make decisions based on adjusted scores rather than raw, error-prone ratings. This perspective highlights a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.

Executive Summary

This article presents an innovative approach to human-in-the-loop evaluation of AI models by incorporating item response theory (IRT) rater models to correct for rater effects. The authors demonstrate the effectiveness of their method using the OpenAI summarization dataset, showing how adjusting for rater severity produces more accurate estimates of summary quality. The article highlights the importance of treating human evaluations as measurements subject to systematic error and proposes a more principled and transparent use of human data. The authors' approach has the potential to improve the reliability and validity of conclusions drawn from human judgments, enabling developers to make more informed decisions. This perspective offers a path toward more robust, interpretable, and construct-aligned practices for AI development and evaluation.

Key Points

  • Human evaluations of AI models are subject to systematic error and must be treated as measurements.
  • IRT rater models can correct for rater effects, such as severity and centrality, that distort observed ratings.
  • The multi-faceted Rasch model is a suitable IRT rater model for separating true output quality from rater behavior.

Merits

Innovative Approach

The authors' integration of IRT rater models into the AI pipeline offers a novel and effective solution to the problem of rater effects.

Improved Reliability and Validity

By correcting for rater effects, the authors' approach improves the reliability and validity of conclusions drawn from human judgments.

Increased Transparency

The authors' method provides diagnostic insight into rater performance, enabling developers to make more informed decisions.

Demerits

Limited Scope

The article focuses on a specific application of AI (text summarization) and may not be generalizable to other domains.

Data Requirements

The authors' method requires large datasets with well-annotated ratings, which may not be readily available for all applications.

Computational Complexity

The multi-faceted Rasch model may be computationally intensive, requiring significant resources and expertise.

Expert Commentary

This article makes a significant contribution to the field of AI development and evaluation by highlighting the importance of correcting for rater effects in human-in-the-loop evaluation. The authors' innovative approach using IRT rater models offers a promising solution to this challenge. However, further research is needed to explore the generalizability of this approach to other domains and to address the computational complexity and data requirements associated with the multi-faceted Rasch model. Additionally, policymakers and regulatory bodies should take note of the implications of this research and develop frameworks to ensure the use of robust and transparent human evaluation methods in AI development and evaluation.

Recommendations

  • Further research should be conducted to explore the generalizability of the authors' approach to other domains.
  • Developers and policymakers should prioritize the use of IRT rater models to correct for rater effects in human-in-the-loop evaluation.

Sources