BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation
arXiv:2602.23580v1 Announce Type: new Abstract: In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a
arXiv:2602.23580v1 Announce Type: new Abstract: In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by "pasting" construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.
Executive Summary
The article proposes a novel approach, BRIDGE, to mitigate bias amplification in automated scoring systems for English Language Learners (ELLs). By generating synthetic high-scoring ELL samples through inter-group data augmentation, BRIDGE reduces prediction bias while maintaining overall scoring performance. The framework synthesizes construct-relevant content from high-scoring non-ELL samples into authentic ELL linguistic patterns, ensuring fairness in large-scale assessments.
Key Points
- ▸ Bias amplification in automated scoring systems affects underrepresented groups like ELLs
- ▸ BRIDGE framework generates synthetic high-scoring ELL samples through inter-group data augmentation
- ▸ Experiments on California Science Test datasets demonstrate reduced prediction bias and maintained scoring performance
Merits
Effective Bias Reduction
BRIDGE successfully reduces prediction bias for high-scoring ELL students
Cost-Effective Solution
The method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution
Demerits
Limited Generalizability
The framework's effectiveness may be limited to specific assessment settings and datasets
Expert Commentary
The proposed BRIDGE framework offers a promising solution to mitigate bias amplification in automated scoring systems. By leveraging inter-group data augmentation, BRIDGE addresses the scarcity of minority samples and promotes fairness in large-scale assessments. However, further research is needed to ensure the framework's generalizability and applicability to diverse assessment settings. The article's findings have significant implications for promoting equity in educational assessments and highlight the importance of considering fairness in AI-driven decision-making systems.
Recommendations
- ✓ Further research on the generalizability of the BRIDGE framework to diverse assessment settings
- ✓ Exploration of the framework's applicability to other underrepresented groups beyond ELLs