Academic

Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

arXiv:2603.04820v1 Announce Type: new Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhib

M
Michael Hardy
· · 1 min read · 12 views

arXiv:2603.04820v1 Announce Type: new Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.

Executive Summary

This article presents a meta-analytic study examining the performance of automated short-answer scoring using Large Language Models (LLMs). The study meta-analyzes 890 results from systematic reviews of LLM short-answer scoring studies and finds that LLMs underperform human scorers by 0.37 in Quadratic Weighted Kappa (QWK). The study also identifies that certain scoring tasks are easier for human scorers but harder for LLMs. Furthermore, the study highlights the importance of systems design that anticipates the statistical shortcomings of autoregressive models and demonstrates racial bias in high-stakes education contexts. The findings have significant implications for the development of LLM-based scoring systems and the need for more research in this area.

Key Points

  • LLMs underperform human scorers by 0.37 in Quadratic Weighted Kappa (QWK)
  • Certain scoring tasks are easier for human scorers but harder for LLMs
  • LLMs demonstrate racial bias in high-stakes education contexts
  • Systems design should anticipate the statistical shortcomings of autoregressive models

Merits

Statistical Rigor

The study employs a meta-analytic approach, providing a high level of statistical rigor in examining the performance of LLMs across multiple studies.

Practical Implications

The findings have significant practical implications for the development of LLM-based scoring systems and the need for more research in this area.

Demerits

Limited Generalizability

The study's findings may not be generalizable to all LLM-based scoring systems, and further research is needed to validate the results.

Lack of Transparency

The study's methodology and data analysis are not fully transparent, which may limit the reproducibility of the results.

Expert Commentary

The study provides a comprehensive examination of the performance of LLMs in short-answer scoring tasks. The findings highlight the need for more research on bias in AI systems and the importance of systems design that anticipates the statistical shortcomings of autoregressive models. Furthermore, the study emphasizes the need for policymakers to consider the limitations of LLM-based scoring systems in education contexts and prioritize transparency, accountability, and fairness in their development. The study's methodology and findings demonstrate a high level of statistical rigor, but the lack of transparency in the methodology and data analysis may limit the reproducibility of the results.

Recommendations

  • Develop more accurate and unbiased LLM-based scoring systems
  • Prioritize transparency, accountability, and fairness in the development of LLM-based scoring systems

Sources