Academic

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

arXiv:2603.06424v1 Announce Type: new Abstract: Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for Englis

arXiv:2603.06424v1 Announce Type: new Abstract: Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2

Executive Summary

This study presents a comprehensive comparative analysis of four major Large Language Model (LLM)-based Automated Essay Scoring (AES) paradigms on English as a Second Language (L2) writing. The researchers evaluate the performance of four approaches on a unified benchmark, IELTS Writing Task 2, and reveal clear accuracy-cost-robustness trade-offs across methods. The study finds that the best configuration, integrating knowledge-based Supervised Fine-Tuning (k-SFT) and Retrieval-Augmented Generation (RAG), achieves the strongest overall results with an F1-Score of 93%. This work offers the first unified empirical comparison of modern LLM-based AES strategies for English L2 and has significant implications for auto-grading writing tasks.

Key Points

  • The study presents a comprehensive comparative analysis of four LLM-based AES paradigms on L2 writing.
  • The researchers evaluate the performance of four approaches on a unified benchmark, IELTS Writing Task 2.
  • The study reveals clear accuracy-cost-robustness trade-offs across methods and identifies the best configuration for LLM-based AES.

Merits

Strength in Methodology

The study employs a unified benchmark and a comprehensive comparison of four LLM-based AES paradigms, providing a robust and reliable analysis of the relative merits of each approach.

Implications for Auto-Grading

The study has significant implications for the development of auto-grading writing tasks, particularly for English L2 writing, and highlights the potential of LLM-based AES strategies for improving assessment efficiency and accuracy.

Code Availability

The researchers provide open-source code for the study, facilitating reproducibility and future research in this area.

Demerits

Limited Generalizability

The study focuses on a specific benchmark and language pair, which may limit the generalizability of the findings to other contexts and languages.

Lack of Human Evaluation

The study relies on automated metrics for evaluating essay quality, which may not capture the nuances of human evaluation and assessment.

Expert Commentary

This study presents a significant contribution to the field of LLM-based AES and has far-reaching implications for language education and assessment. The comprehensive comparison of four LLM-based AES paradigms on a unified benchmark provides a robust and reliable analysis of the relative merits of each approach. However, the study's limitations, such as the lack of human evaluation and limited generalizability, should be acknowledged and addressed in future research. The study's findings highlight the potential of LLM-based AES strategies for improving assessment efficiency and accuracy in English L2 writing and have significant implications for language education policy and the development of AI-powered assessment tools.

Recommendations

  • Future research should focus on extending the study's findings to other languages and contexts, to improve the generalizability of the results.
  • The use of human evaluation and feedback mechanisms should be explored to provide a more comprehensive assessment of essay quality and LLM-based AES performance.

Sources