Academic

Are LLMs Ready to Replace Bangla Annotators?

arXiv:2602.16241v1 Announce Type: new Abstract: Large Language Models (LLMs) are increasingly used as automated annotators to scale dataset creation, yet their reliability as unbiased annotators--especially for low-resource and identity-sensitive settings--remains poorly understood. In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences. We conduct a systematic benchmark of 17 LLMs using a unified evaluation framework. Our analysis uncovers annotator bias and substantial instability in model judgments. Surprisingly, increased model scale does not guarantee improved annotation quality--smaller, more task-aligned models frequently exhibit more consistent behavior than their larger counterparts. These results highlight important limitations of current LLMs for sensitive annotation tasks in low-resource languages and underscore the need for car

Md. Najib Hasan, Touseef Hasan, Souvika Sarkar · February 20, 2026 · 1 min read · 7 views

#cs.CL #cs.AI

Executive Summary

The article 'Are LLMs Ready to Replace Bangla Annotators?' investigates the reliability and bias of Large Language Models (LLMs) as zero-shot annotators for Bangla hate speech, a task fraught with challenges due to low-resource settings and identity sensitivity. The study benchmarks 17 LLMs, revealing significant annotator bias and instability in model judgments. Contrary to expectations, larger models do not consistently outperform smaller, task-aligned models in annotation quality. The findings underscore the need for rigorous evaluation before deploying LLMs for sensitive annotation tasks in low-resource languages.

Key Points

▸ LLMs exhibit substantial bias and instability as zero-shot annotators for Bangla hate speech.
▸ Increased model scale does not guarantee improved annotation quality.
▸ Smaller, task-aligned models often perform more consistently than larger models.

Merits

Comprehensive Benchmarking

The study systematically evaluates 17 LLMs using a unified framework, providing a robust comparison of their performance in a low-resource language context.

Identification of Bias and Instability

The research highlights critical issues of bias and instability in LLM annotations, which are particularly relevant for sensitive tasks like hate speech detection.

Task-Specific Insights

The findings offer valuable insights into the performance of LLMs in low-resource languages, emphasizing the importance of task alignment over model size.

Demerits

Limited Generalizability

The study focuses solely on Bangla hate speech, which may limit the generalizability of the findings to other languages or annotation tasks.

Evaluation Framework

The evaluation framework, while unified, may not capture all nuances of annotator bias and instability, potentially leading to incomplete conclusions.

Model Diversity

The selection of 17 LLMs may not represent the full spectrum of available models, potentially biasing the results towards certain types of models.

Expert Commentary

The article presents a timely and critical examination of the capabilities and limitations of LLMs as annotators for sensitive tasks in low-resource languages. The findings are particularly relevant given the increasing reliance on AI for tasks that require high levels of accuracy and fairness. The revelation that larger models do not necessarily perform better than smaller, task-aligned models challenges the prevailing assumption that scale is the primary determinant of model performance. This study serves as a cautionary tale, emphasizing the need for careful evaluation and consideration of task-specific requirements before deploying LLMs. The identification of bias and instability in model judgments further underscores the importance of ethical considerations in AI development and deployment. As AI systems continue to be integrated into various domains, such rigorous evaluations will be crucial in ensuring their reliability and fairness.

Recommendations

✓ Conduct comprehensive evaluations of LLMs for specific tasks, considering both model size and task alignment.
✓ Develop more nuanced evaluation metrics that capture the complexities of bias and instability in AI systems.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Are LLMs Ready to Replace Bangla Annotators?

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Benchmarking

Identification of Bias and Instability

Task-Specific Insights

Demerits

Limited Generalizability

Evaluation Framework

Model Diversity

Expert Commentary

Recommendations

Sources

Related Articles

How Large Language Models Get Stuck: Early structure with persistent …

Distribution-Aware Companding Quantization of Large Language Models

Policy Compliance of User Requests in Natural Language for AI …

LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

JCG, PC

HSOLLC Co., Ltd.