Optimizing In-Context Demonstrations for LLM-based Automated Grading
arXiv:2603.00465v1 Announce Type: new Abstract: Automated assessment of open-ended student responses is a critical capability for scaling personalized feedback in education. While large language models (LLMs) have shown promise in grading tasks via in-context learning (ICL), their reliability is heavily dependent on the selection of few-shot exemplars and the construction of high-quality rationales. Standard retrieval methods typically select examples based on semantic similarity, which often fails to capture subtle decision boundaries required for rubric adherence. Furthermore, manually crafting the expert rationales needed to guide these models can be a significant bottleneck. To address these limitations, we introduce GUIDE (Grading Using Iteratively Designed Exemplars), a framework that reframes exemplar selection and refinement in automated grading as a boundary-focused optimization problem. GUIDE operates on a continuous loop of selection and refinement, employing novel contrast
arXiv:2603.00465v1 Announce Type: new Abstract: Automated assessment of open-ended student responses is a critical capability for scaling personalized feedback in education. While large language models (LLMs) have shown promise in grading tasks via in-context learning (ICL), their reliability is heavily dependent on the selection of few-shot exemplars and the construction of high-quality rationales. Standard retrieval methods typically select examples based on semantic similarity, which often fails to capture subtle decision boundaries required for rubric adherence. Furthermore, manually crafting the expert rationales needed to guide these models can be a significant bottleneck. To address these limitations, we introduce GUIDE (Grading Using Iteratively Designed Exemplars), a framework that reframes exemplar selection and refinement in automated grading as a boundary-focused optimization problem. GUIDE operates on a continuous loop of selection and refinement, employing novel contrastive operators to identify "boundary pairs" that are semantically similar but possess different grades. We enhance exemplars by generating discriminative rationales that explicitly articulate why a response receives a specific score to the exclusion of adjacent grades. Extensive experiments across datasets in physics, chemistry, and pedagogical content knowledge demonstrate that GUIDE significantly outperforms standard retrieval baselines. By focusing the model's attention on the precise edges of rubric, our approach shows exceptionally robust gains on borderline cases and improved rubric adherence. GUIDE paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Executive Summary
This study proposes GUIDE, a novel framework for optimizing in-context demonstrations in large language model (LLM)-based automated grading. GUIDE addresses the limitations of standard retrieval methods by reframing exemplar selection and refinement as a boundary-focused optimization problem. The framework operates on a continuous loop of selection and refinement, employing contrastive operators to identify 'boundary pairs' that are semantically similar but possess different grades. Extensive experiments demonstrate that GUIDE significantly outperforms standard retrieval baselines, achieving exceptionally robust gains on borderline cases and improved rubric adherence. This approach paves the way for trusted, scalable assessment systems that align closely with human pedagogical standards.
Key Points
- ▸ GUIDE reframes exemplar selection and refinement as a boundary-focused optimization problem.
- ▸ The framework employs contrastive operators to identify 'boundary pairs' with different grades.
- ▸ GUIDE achieves exceptionally robust gains on borderline cases and improved rubric adherence.
Merits
Robustness and Scalability
GUIDE's ability to achieve robust gains on borderline cases and improved rubric adherence demonstrates its potential to scale personalized feedback in education.
Pedagogical Alignment
The framework's focus on boundary-focused optimization enables GUIDE to align closely with human pedagogical standards.
Demerits
Implementation Complexity
The continuous loop of selection and refinement may require significant computational resources and expertise to implement.
Data Requirements
The framework relies on high-quality exemplars and rationales, which may be difficult to obtain or generate.
Expert Commentary
The study's framework, GUIDE, represents a significant advancement in the field of automated grading. By reframing exemplar selection and refinement as a boundary-focused optimization problem, GUIDE addresses the limitations of standard retrieval methods and achieves exceptional gains on borderline cases and rubric adherence. However, the implementation complexity and data requirements of the framework may pose challenges for widespread adoption. Nevertheless, the study's findings have far-reaching implications for education policy and practice, highlighting the potential for trusted and scalable assessment systems that align closely with human pedagogical standards.
Recommendations
- ✓ Future research should focus on refining the implementation complexity of GUIDE and developing more efficient methods for generating high-quality exemplars and rationales.
- ✓ Policymakers should consider the study's findings when developing education policies related to assessment and grading.