Skip to main content
Academic

Robust AI Evaluation through Maximal Lotteries

arXiv:2602.21297v1 Announce Type: new Abstract: The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win rate guarantees across the annotator distribution a

arXiv:2602.21297v1 Announce Type: new Abstract: The standard way to evaluate language models on subjective tasks is through pairwise comparisons: an annotator chooses the "better" of two responses to a prompt. Leaderboards aggregate these comparisons into a single Bradley-Terry (BT) ranking, forcing heterogeneous preferences into a total order and violating basic social-choice desiderata. In contrast, social choice theory provides an alternative approach called maximal lotteries, which aggregates pairwise preferences without imposing any assumptions on their structure. However, we show that maximal lotteries are highly sensitive to preference heterogeneity and can favor models that severely underperform on specific tasks or user subpopulations. We introduce robust lotteries that optimize worst-case performance under plausible shifts in the preference data. On large-scale preference datasets, robust lotteries provide more reliable win rate guarantees across the annotator distribution and recover a stable set of top-performing models. By moving from rankings to pluralistic sets of winners, robust lotteries offer a principled step toward an ecosystem of complementary AI systems that serve the full spectrum of human preferences.

Executive Summary

This article introduces 'robust lotteries' as an alternative approach to evaluating language models on subjective tasks, addressing the limitations of traditional pairwise comparisons. By aggregating preferences without imposing assumptions on their structure, robust lotteries provide more reliable win rate guarantees across annotator distributions and recover a stable set of top-performing models. This approach moves towards an ecosystem of complementary AI systems that serve diverse human preferences. However, robust lotteries are highly sensitive to preference heterogeneity, favoring models that underperform on specific tasks or user subpopulations. The authors propose optimizing worst-case performance under plausible shifts in preference data to mitigate this issue.

Key Points

  • The traditional pairwise comparison approach to evaluating language models has limitations
  • Robust lotteries provide a more reliable approach to aggregating preferences without imposing assumptions on their structure
  • Robust lotteries are sensitive to preference heterogeneity, favoring models that underperform on specific tasks or user subpopulations

Merits

Strength

Provides a principled approach to addressing the limitations of traditional pairwise comparisons and recovering a stable set of top-performing models

Innovative solution

Introduces the concept of robust lotteries as a novel approach to evaluating language models

Demerits

Limitation

Sensitive to preference heterogeneity, potentially favoring models that underperform on specific tasks or user subpopulations

Implementation challenge

Requires optimizing worst-case performance under plausible shifts in preference data, which may be computationally complex

Expert Commentary

The article presents a novel and timely solution to the limitations of traditional pairwise comparisons in evaluating language models. By introducing the concept of robust lotteries, the authors offer a principled approach to aggregating preferences without imposing assumptions on their structure. This approach has significant implications for the development of more effective AI evaluation metrics and the accommodation of diverse human preferences in AI development. However, the sensitivity of robust lotteries to preference heterogeneity and potential implementation challenges highlight the need for further research and development. As the AI ecosystem continues to evolve, the adoption of robust lotteries can contribute to a more reliable and diverse evaluation process.

Recommendations

  • Further research is needed to develop more effective optimization methods for worst-case performance under plausible shifts in preference data
  • The development of robust lotteries should be integrated into AI evaluation frameworks to promote more reliable and diverse model evaluations

Sources