Academic

Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

arXiv:2603.18765v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with ef

R
Rudra Jadhav, Janhavi Danve, Sonalika Shaw
· · 1 min read · 6 views

arXiv:2603.18765v1 Announce Type: new Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs -- LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) -- were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen's d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale -- penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

Executive Summary

This study examines the phenomenon of implicit grading bias in large language models (LLMs) when evaluating student responses. Researchers constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing) and employed two state-of-the-art open-source LLMs to grade responses. The results reveal statistically significant grading bias in Essay/Writing tasks, with both models penalizing informal language and non-native phrasing. The findings suggest that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions. These results have significant implications for the equitable deployment of LLM-based grading systems and highlight the need for bias auditing protocols before institutional adoption.

Key Points

  • Implicit grading bias in LLMs affects student evaluations across Math, Programming, and Essay tasks.
  • The bias is particularly pronounced in Essay/Writing tasks, with informal language and non-native phrasing receiving significant penalties.
  • The bias persists despite explicit counter-bias instructions in the grading prompt.

Merits

Strength in methodology

The study employs a well-controlled dataset and utilizes two state-of-the-art LLMs to evaluate responses, ensuring robust results.

Relevance to educational settings

The study's findings have significant implications for the deployment of LLM-based grading systems in educational settings, highlighting the need for bias auditing protocols.

Demerits

Limited generalizability

The study's findings may not generalize to other LLMs or educational contexts, requiring further research to establish the scope of the issue.

Dependence on specific grading prompts

The study's results may be influenced by the specific grading prompts used, which could affect the LLMs' behavior and the results obtained.

Expert Commentary

This study represents a crucial step in understanding the limitations and challenges associated with the use of LLMs in educational settings. The findings highlight the need for a more nuanced approach to AI-powered grading, one that acknowledges the potential for bias and takes steps to mitigate its impact. The study's results also underscore the importance of developing robust auditing protocols to ensure the fairness and equity of LLM-based grading systems. As the use of LLMs continues to grow, it is essential that researchers, policymakers, and educators work together to address these challenges and ensure that AI-powered grading systems are used in a way that benefits all students, regardless of their background or abilities.

Recommendations

  • Future research should investigate the scope of LLM grading bias and its potential impact on student outcomes.
  • Developers and educators should work together to create more transparent and explainable AI-powered grading systems, enabling more effective bias mitigation and auditing.

Sources