Academic

From Feature-Based Models to Generative AI: Validity Evidence for Constructed Response Scoring

arXiv:2603.19280v1 Announce Type: cross Abstract: The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI con

arXiv:2603.19280v1 Announce Type: cross Abstract: The rapid advancements in large language models and generative artificial intelligence (AI) capabilities are making their broad application in the high-stakes testing context more likely. Use of generative AI in the scoring of constructed responses is particularly appealing because it reduces the effort required for handcrafting features in traditional AI scoring and might even outperform those methods. The purpose of this paper is to highlight the differences in the feature-based and generative AI applications in constructed response scoring systems and propose a set of best practices for the collection of validity evidence to support the use and interpretation of constructed response scores from scoring systems using generative AI. We compare the validity evidence needed in scoring systems using human ratings, feature-based natural language processing AI scoring engines, and generative AI. The evidence needed in the generative AI context is more extensive than in the feature-based scoring context because of the lack of transparency and other concerns unique to generative AI such as consistency. Constructed response score data from a large corpus of independent argumentative essays written by 6-12th grade students demonstrate the collection of validity evidence for different types of scoring systems and highlight the numerous complexities and considerations when making a validity argument for these scores.

Executive Summary

This article explores the shift from feature-based models to generative AI in constructed response scoring, highlighting the need for a comprehensive framework to validate scores from generative AI systems. The authors propose best practices for collecting validity evidence and demonstrate the complexity of the task using a dataset of 6-12th grade student essays. They emphasize the importance of considering transparency, consistency, and other unique challenges associated with generative AI. The study's findings have significant implications for the development and implementation of high-stakes testing systems, particularly in educational settings. As the field of AI continues to evolve, this research provides a critical framework for ensuring the reliability and validity of AI-generated scores.

Key Points

  • The shift from feature-based to generative AI in constructed response scoring raises concerns about validity and reliability.
  • Generative AI systems require a more extensive collection of validity evidence due to transparency and consistency concerns.
  • Best practices for collecting validity evidence are proposed, including data from human ratings, feature-based AI scoring, and generative AI.

Merits

Strength in Addressing Critical Gap

The article addresses a critical gap in the research on AI-generated scores, providing a much-needed framework for ensuring validity and reliability.

Robust Methodology

The study employs a robust methodology, using a large corpus of student essays to demonstrate the complexities of collecting validity evidence for generative AI systems.

Demerits

Limitation in Generalizability

The study's findings may not be generalizable to all educational settings, as the dataset consists of 6-12th grade student essays.

Need for Further Research

While the article provides a framework for collecting validity evidence, further research is needed to fully understand the implications of generative AI on high-stakes testing systems.

Expert Commentary

The article makes a significant contribution to the field of AI-generated scores, providing a critical framework for ensuring validity and reliability. However, the study's limitations and the need for further research highlight the complexity of this issue. As the field of AI continues to evolve, it is essential to address these concerns and develop best practices for the use of generative AI in high-stakes testing systems. This article serves as a crucial starting point for this discussion, and its findings have significant implications for the development of educational policies and guidelines.

Recommendations

  • Develop and implement policies and guidelines for the use of generative AI in high-stakes testing systems.
  • Conduct further research on the implications of generative AI on student outcomes and educational equity.

Sources

Original: arXiv - cs.AI