Quantifying construct validity in large language model evaluations
arXiv:2602.15532v1 Announce Type: new Abstract: The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninter
arXiv:2602.15532v1 Announce Type: new Abstract: The LLM community often reports benchmark results as if they are synonymous with general model capabilities. However, benchmarks can have problems that distort performance, like test set contamination and annotator error. How can we know that a benchmark is a reliable indicator of some capability that we want to measure? This question concerns the construct validity of LLM benchmarks, and it requires separating benchmark results from capabilities when we model and predict LLM performance. Both social scientists and computer scientists propose formal models - latent factor models and scaling laws - for identifying the capabilities underlying benchmark scores. However, neither technique is satisfactory for construct validity. Latent factor models ignore scaling laws, and as a result, the capabilities they extract often proxy model size. Scaling laws ignore measurement error, and as a result, the capabilities they extract are both uninterpretable and overfit to the observed benchmarks. This thesis presents the structured capabilities model, the first model to extract interpretable and generalisable capabilities from a large collection of LLM benchmark results. I fit this model and its two alternatives on a large sample of results from the OpenLLM Leaderboard. Structured capabilities outperform latent factor models on parsimonious fit indices, and exhibit better out-of-distribution benchmark prediction than scaling laws. These improvements are possible because neither existing approach separates model scale from capabilities in the appropriate way. Model scale should inform capabilities, as in scaling laws, and these capabilities should inform observed results up to measurement error, as in latent factor models. In combining these two insights, structured capabilities demonstrate better explanatory and predictive power for quantifying construct validity in LLM evaluations.
Executive Summary
This article presents a novel approach, the structured capabilities model, for quantifying construct validity in large language model evaluations. The authors argue that existing methods, such as latent factor models and scaling laws, fail to separate model scale from capabilities, leading to inaccurate and overfitted results. The structured capabilities model improves upon these approaches by combining the insights of scaling laws and latent factor models, demonstrating better explanatory and predictive power for quantifying construct validity. The model is evaluated on a large sample of results from the OpenLLM Leaderboard, outperforming existing approaches on parsimonious fit indices and out-of-distribution benchmark prediction.
Key Points
- ▸ The LLM community often reports benchmark results without considering construct validity, leading to distorted performance measurements.
- ▸ Existing methods, such as latent factor models and scaling laws, fail to separate model scale from capabilities, resulting in inaccurate and overfitted results.
- ▸ The structured capabilities model combines the insights of scaling laws and latent factor models, providing a more accurate and interpretable measure of construct validity.
Merits
Improve Construct Validity
The structured capabilities model provides a more accurate and interpretable measure of construct validity, enabling researchers to better understand the capabilities underlying LLM performance.
Enhance Predictive Power
The model demonstrates better out-of-distribution benchmark prediction, allowing researchers to make more accurate predictions about LLM performance in new and unseen scenarios.
Demerits
Limited Evaluation Dataset
The model is evaluated on a large sample of results from the OpenLLM Leaderboard, but it is unclear whether the findings would generalize to other datasets or LLMs.
Lack of Theoretical Justification
The article does not provide a detailed theoretical justification for the structured capabilities model, making it difficult to understand the underlying assumptions and mechanisms.
Expert Commentary
The structured capabilities model provides a promising new approach for evaluating and validating LLMs. However, further research is needed to fully understand the underlying assumptions and mechanisms, and to evaluate the model's performance on a wider range of datasets and LLMs. Additionally, the article highlights the importance of considering construct validity in LLM evaluations, and the need for more nuanced and accurate measures of performance. Overall, the article makes a significant contribution to the field of LLM research and development.
Recommendations
- ✓ Develop and evaluate the structured capabilities model on a wider range of datasets and LLMs to improve its robustness and generalizability.
- ✓ Provide a detailed theoretical justification for the structured capabilities model, including a clear explanation of the underlying assumptions and mechanisms.