Are LLMs Ready to Replace Bangla Annotators?
arXiv:2602.16241v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for car
arXiv:2602.16241v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for careful evaluation before deployment.
Executive Summary
The article 'Are LLMs Ready to Replace Bangla Annotators?' investigates the reliability and bias of Large Language Models (LLMs) as zero-shot annotators for Bangla hate speech, a task fraught with challenges due to low-resource settings and identity sensitivity. The study benchmarks 17 LLMs, revealing significant annotator bias and instability in model judgments. Contrary to expectations, larger models do not consistently outperform smaller, task-aligned models in annotation quality. The findings underscore the need for rigorous evaluation before deploying LLMs for sensitive annotation tasks in low-resource languages.
Key Points
- ▸ LLMs exhibit substantial bias and instability as zero-shot annotators for Bangla hate speech.
- ▸ Increased model scale does not guarantee improved annotation quality.
- ▸ Smaller, task-aligned models often perform more consistently than larger models.
Merits
Comprehensive Benchmarking
The study systematically evaluates 17 LLMs using a unified framework, providing a robust comparison of their performance in a low-resource language context.
Identification of Bias and Instability
The research highlights critical issues of bias and instability in LLM annotations, which are particularly relevant for sensitive tasks like hate speech detection.
Task-Specific Insights
The findings offer valuable insights into the performance of LLMs in low-resource languages, emphasizing the importance of task alignment over model size.
Demerits
Limited Generalizability
The study focuses solely on Bangla hate speech, which may limit the generalizability of the findings to other languages or annotation tasks.
Evaluation Framework
The evaluation framework, while unified, may not capture all nuances of annotator bias and instability, potentially leading to incomplete conclusions.
Model Diversity
The selection of 17 LLMs may not represent the full spectrum of available models, potentially biasing the results towards certain types of models.
Expert Commentary
The article presents a timely and critical examination of the capabilities and limitations of LLMs as annotators for sensitive tasks in low-resource languages. The findings are particularly relevant given the increasing reliance on AI for tasks that require high levels of accuracy and fairness. The revelation that larger models do not necessarily perform better than smaller, task-aligned models challenges the prevailing assumption that scale is the primary determinant of model performance. This study serves as a cautionary tale, emphasizing the need for careful evaluation and consideration of task-specific requirements before deploying LLMs. The identification of bias and instability in model judgments further underscores the importance of ethical considerations in AI development and deployment. As AI systems continue to be integrated into various domains, such rigorous evaluations will be crucial in ensuring their reliability and fairness.
Recommendations
- ✓ Conduct comprehensive evaluations of LLMs for specific tasks, considering both model size and task alignment.
- ✓ Develop more nuanced evaluation metrics that capture the complexities of bias and instability in AI systems.