Academic

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

arXiv:2603.11957v1 Announce Type: new Abstract: Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effecti

arXiv:2603.11957v1 Announce Type: new Abstract: Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

Executive Summary

The article introduces CHiL(L)Grader, a novel automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. This approach enables the model to recognize when predictions are trustworthy, ensuring reliable AI-assisted grading. CHiL(L)Grader achieves expert-level quality in 35-65% of automated scores and demonstrates the effectiveness of uncertainty quantification in AI-assisted grading.

Key Points

  • CHiL(L)Grader incorporates calibrated confidence estimation into a human-in-the-loop workflow
  • The framework automates high-confidence predictions and routes uncertain cases to human graders
  • CHiL(L)Grader adapts to evolving rubrics and unseen questions through continual learning

Merits

Improved Reliability

CHiL(L)Grader's confidence-based routing ensures that only trustworthy predictions are automated, reducing the risk of errors

Adaptability

The framework's ability to adapt to evolving rubrics and unseen questions makes it suitable for dynamic educational environments

Demerits

Dependence on Human Feedback

CHiL(L)Grader's reliance on human feedback for correction cycles may limit its scalability and efficiency

Limited Generalizability

The framework's performance may not generalize to other domains or types of assessments without further training and validation

Expert Commentary

The introduction of CHiL(L)Grader marks a significant step forward in the development of reliable AI-assisted grading systems. By incorporating calibrated confidence estimation into a human-in-the-loop workflow, the framework addresses a critical limitation of existing AI models: their tendency to be overconfident and unreliable. The article's findings demonstrate the effectiveness of this approach, highlighting the importance of uncertainty quantification in AI research. As AI-assisted education tools become increasingly prevalent, the need for transparent and reliable systems like CHiL(L)Grader will only continue to grow.

Recommendations

  • Further research is needed to explore the scalability and generalizability of CHiL(L)Grader's approach
  • Education policymakers should prioritize the development and validation of reliable AI-assisted education tools, incorporating uncertainty quantification and human oversight

Sources