Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
arXiv:2603.04409v1 Announce Type: new Abstract: The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being th
arXiv:2603.04409v1 Announce Type: new Abstract: The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.
Executive Summary
The article introduces the HUMAINE framework for evaluating large language models (LLMs) based on human preference, addressing limitations in existing methods. It presents a multidimensional, demographically aware approach, assessing 28 state-of-the-art models across five human-centric dimensions. The analysis reveals a performance hierarchy, significant preference heterogeneity, and varying discriminative power across evaluation dimensions, emphasizing the need for a more nuanced perspective in LLM evaluation.
Key Points
- ▸ Introduction of the HUMAINE framework for multidimensional, demographically aware LLM evaluation
- ▸ Establishment of a clear performance hierarchy among 28 state-of-the-art models
- ▸ Discovery of significant preference heterogeneity, with user age as a primary demographic axis of disagreement
Merits
Comprehensive Evaluation Approach
The HUMAINE framework provides a comprehensive and nuanced approach to LLM evaluation, considering multiple dimensions and demographic factors.
Large-Scale Data Collection
The collection of multi-turn conversations from 23,404 participants across 22 demographic groups is a significant strength, offering a rich dataset for analysis.
Demerits
Potential Biases in Post-Stratification
The use of post-stratification to census data may introduce biases if the stratification does not perfectly capture the underlying demographics of the population.
Limited Generalizability
The study focuses on the US and UK, which may limit the generalizability of the findings to other regions or cultural contexts.
Expert Commentary
The article presents a significant contribution to the field of LLM evaluation, highlighting the importance of considering human preference and demographic factors in AI development. The HUMAINE framework offers a comprehensive and nuanced approach, which can inform the development of more effective and user-friendly LLMs. However, it is crucial to address potential limitations and biases, ensuring the generalizability and fairness of the evaluation approach. The implications of this research extend beyond the technical realm, emphasizing the need for regulatory frameworks and standards that prioritize demographically aware and multidimensional evaluation in AI development.
Recommendations
- ✓ Future research should focus on addressing potential biases and limitations in the HUMAINE framework, exploring its applicability in diverse cultural and regional contexts.
- ✓ Developers and regulators should prioritize the integration of demographically aware and multidimensional evaluation approaches in AI development, ensuring fairness, transparency, and accountability.