YoNER: A New Yor\`ub\'a Multi-domain Named Entity Recognition Dataset
arXiv:2604.05624v1 Announce Type: new Abstract: Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models
arXiv:2604.05624v1 Announce Type: new Abstract: Named Entity Recognition (NER) is a foundational NLP task, yet research in Yor\`ub\'a has been constrained by limited and domain-specific resources. Existing resources, such as MasakhaNER (a manually annotated news-domain corpus) and WikiAnn (automatically created from Wikipedia), are valuable but restricted in domain coverage. To address this gap, we present YoNER, a new multidomain Yor\`ub\'a NER dataset that extends entity coverage beyond news and Wikipedia. The dataset comprises about 5,000 sentences and 100,000 tokens collected from five domains including Bible, Blogs, Movies, Radio broadcast and Wikipedia, and annotated with three entity types: Person (PER), Organization (ORG) and Location (LOC), following CoNLL-style guidelines. Annotation was conducted manually by three native Yor\`ub\'a speakers, with an inter-annotator agreement of over 0.70, ensuring high quality and consistency. We benchmark several transformer encoder models using cross-domain experiments with MasakhaNER 2.0, and we also assess the effect of few-shot in-domain data using YoNER and cross-lingual setups with English datasets. Our results show that African-centric models outperform general multilingual models for Yor\`ub\'a, but cross-domain performance drops substantially, particularly for blogs and movie domains. Furthermore, we observed that closely related formal domains, such as news and Wikipedia, transfer more effectively. In addition, we introduce a new Yor\`ub\'a-specific language model (OyoBERT) that outperforms multilingual models in in-domain evaluation. We publicly release the YoNER dataset and pretrained OyoBERT models to support future research on Yor\`ub\'a natural language processing.
Executive Summary
The article introduces YoNER, a pioneering multi-domain Yorùbá Named Entity Recognition (NER) dataset designed to bridge critical gaps in existing NLP resources for low-resource languages. Spanning five domains—Bible, Blogs, Movies, Radio broadcasts, and Wikipedia—the dataset comprises 5,000 sentences and 100,000 tokens, annotated manually by native speakers for Person, Organization, and Location entities. The authors demonstrate that YoNER enables rigorous cross-domain and cross-lingual benchmarking, revealing that African-centric models (e.g., MasakhaNER 2.0) outperform general multilingual models for Yorùbá NER, though performance degrades significantly in informal domains like blogs and movies. Additionally, the study introduces OyoBERT, a Yorùbá-specific language model, which surpasses multilingual baselines in in-domain evaluations. The dataset and model are publicly released, offering transformative potential for advancing NLP research in African languages.
Key Points
- ▸ YoNER is the first multi-domain Yorùbá NER dataset, addressing the limitations of domain-specific corpora like MasakhaNER and WikiAnn.
- ▸ The dataset includes 5,000 sentences and 100,000 tokens across five diverse domains, annotated manually by native Yorùbá speakers with high inter-annotator agreement (Cohen’s kappa > 0.70).
- ▸ Benchmarking reveals that African-centric models (e.g., MasakhaNER 2.0) outperform general multilingual models for Yorùbá NER, but cross-domain performance drops markedly in informal domains (e.g., blogs, movies).
- ▸ The introduction of OyoBERT, a Yorùbá-specific language model, demonstrates superior in-domain performance compared to multilingual baselines.
- ▸ The authors publicly release YoNER and OyoBERT to foster further research in Yorùbá and low-resource language NLP.
Merits
Novelty and Scope
YoNER is the first multi-domain Yorùbá NER dataset, addressing a critical gap in NLP resources for low-resource languages. The inclusion of diverse domains (e.g., Bible, Blogs, Movies) ensures broader applicability and realism in NLP tasks.
Rigorous Annotation and Quality Control
The dataset was manually annotated by three native Yorùbá speakers with an inter-annotator agreement exceeding 0.70, ensuring high-quality and consistent annotations. This rigor sets a benchmark for future NLP datasets in African languages.
Technical Innovation and Benchmarking
The introduction of OyoBERT and the comprehensive benchmarking against multilingual models (e.g., XLM-R, mBERT) provide valuable insights into the performance of African-centric models for Yorùbá NER, highlighting the superiority of language-specific models in low-resource settings.
Open Science and Reproducibility
The public release of YoNER and OyoBERT fosters reproducibility, collaboration, and further innovation in Yorùbá NLP, aligning with global efforts to decolonize AI and expand linguistic diversity in NLP research.
Demerits
Limited Entity Types
YoNER focuses on only three entity types (Person, Organization, Location), which may restrict its utility for applications requiring finer-grained entity recognition (e.g., medical, legal, or temporal entities).
Domain-Specific Performance Gaps
The study highlights significant performance drops in informal domains like blogs and movies, suggesting that YoNER may not fully capture the linguistic variability or noise present in such domains.
Scalability Concerns
The dataset size (5,000 sentences, 100,000 tokens) is relatively small compared to high-resource language datasets, which may limit its applicability for large-scale or production-level NLP systems.
Cross-Lingual Transfer Limitations
While the study explores cross-lingual setups with English datasets, the lack of detailed analysis on transfer learning from other African languages (e.g., Hausa, Swahili) may leave questions about the broader generalizability of the findings.
Expert Commentary
The introduction of YoNER represents a significant milestone in the advancement of NLP for low-resource African languages. The authors have meticulously curated a multi-domain dataset that not only addresses a critical gap in existing resources but also provides a robust benchmarking framework for evaluating NER models in Yorùbá. The manual annotation process, with high inter-annotator agreement, ensures the reliability of the dataset, while the introduction of OyoBERT demonstrates the tangible benefits of language-specific models over general multilingual approaches. However, the study also highlights persistent challenges, particularly in informal domains where performance lags, underscoring the need for further research in domain adaptation and noise handling. The public release of YoNER and OyoBERT is commendable and aligns with global efforts to decolonize AI, ensuring that African languages are not left behind in the AI revolution. This work sets a new standard for NLP research in low-resource languages and should inspire similar initiatives for other African languages.
Recommendations
- ✓ Expand YoNER to include additional entity types (e.g., temporal, medical, or legal entities) to enhance its utility for specialized NLP applications.
- ✓ Conduct further research to improve cross-domain performance, particularly in informal domains like blogs and movies, through techniques such as domain adaptation, data augmentation, or noise-aware modeling.
- ✓ Explore transfer learning from other African languages (e.g., Hausa, Swahili) to assess the generalizability of OyoBERT and YoNER, and to foster collaboration across linguistic communities.
- ✓ Increase the size of YoNER to include more diverse and representative data, particularly from underrepresented domains or dialects, to improve scalability and robustness.
- ✓ Develop community-driven annotation guidelines and tools to ensure the sustainability and continuous improvement of YoNER, involving native speakers and linguists in the process.
Sources
Original: arXiv - cs.CL