Academic

Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

arXiv:2602.18964v1 Announce Type: new Abstract: Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present \textbf{Yor-Sarc}, the first gold-standard dataset for sarcasm detection in Yor\`{u}b\'{a}, a tonal Niger-Congo language spoken by over $50$ million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yor\`{u}b\'{a} sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' $\kappa = 0.7660$; pairwi

T
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov
· · 1 min read · 20 views

arXiv:2602.18964v1 Announce Type: new Abstract: Sarcasm detection poses a fundamental challenge in computational semantics, requiring models to resolve disparities between literal and intended meaning. The challenge is amplified in low-resource languages where annotated datasets are scarce or nonexistent. We present \textbf{Yor-Sarc}, the first gold-standard dataset for sarcasm detection in Yor\`{u}b\'{a}, a tonal Niger-Congo language spoken by over $50$ million people. The dataset comprises 436 instances annotated by three native speakers from diverse dialectal backgrounds using an annotation protocol specifically designed for Yor\`{u}b\'{a} sarcasm by taking culture into account. This protocol incorporates context-sensitive interpretation and community-informed guidelines and is accompanied by a comprehensive analysis of inter-annotator agreement to support replication in other African languages. Substantial to almost perfect agreement was achieved (Fleiss' $\kappa = 0.7660$; pairwise Cohen's $\kappa = 0.6732$--$0.8743$), with $83.3\%$ unanimous consensus. One annotator pair achieved almost perfect agreement ($\kappa = 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works. The remaining $16.7\%$ majority-agreement cases are preserved as soft labels for uncertainty-aware modelling. Yor-Sarc\footnote{https://github.com/toheebadura/yor-sarc} is expected to facilitate research on semantic interpretation and culturally informed NLP for low-resource African languages.

Executive Summary

The article introduces Yor-Sarc, a gold-standard dataset for sarcasm detection in Yorùbá, a low-resource African language. The dataset, comprising 436 instances annotated by three native speakers from diverse dialectal backgrounds, achieves substantial to almost perfect inter-annotator agreement. The study highlights the importance of culturally informed annotation protocols and provides a foundation for research in semantic interpretation and NLP for low-resource African languages.

Key Points

  • Introduction of Yor-Sarc, the first gold-standard dataset for sarcasm detection in Yorùbá.
  • High inter-annotator agreement with Fleiss' κ = 0.7660 and pairwise Cohen's κ ranging from 0.6732 to 0.8743.
  • Culturally informed annotation protocol designed specifically for Yorùbá sarcasm.
  • Potential for facilitating research in semantic interpretation and culturally informed NLP for low-resource African languages.

Merits

Culturally Informed Annotation

The study's use of a culturally informed annotation protocol is a significant strength, ensuring that the dataset is relevant and accurate for the target language and culture.

High Inter-Annotator Agreement

The substantial to almost perfect inter-annotator agreement demonstrates the reliability and quality of the dataset, making it a valuable resource for future research.

Foundation for Low-Resource Language Research

Yor-Sarc sets a precedent for similar efforts in other low-resource African languages, addressing a critical gap in the field of NLP.

Demerits

Limited Dataset Size

The dataset comprises only 436 instances, which may limit the scope and generalizability of the findings.

Potential Bias in Annotation

Despite the diverse dialectal backgrounds of the annotators, there may still be biases or inconsistencies in the annotation process that could affect the dataset's reliability.

Focus on a Single Language

While Yor-Sarc is a significant contribution, its focus on a single language may limit its immediate applicability to other African languages.

Expert Commentary

The introduction of Yor-Sarc marks a significant milestone in the field of NLP, particularly for low-resource African languages. The study's rigorous approach to culturally informed annotation and high inter-annotator agreement sets a new standard for dataset creation in this domain. The dataset's potential to facilitate research in semantic interpretation and sarcasm detection is substantial, addressing a critical gap in the literature. However, the limited dataset size and potential biases in annotation warrant caution in generalizing the findings. Future research should aim to expand the dataset and explore similar efforts in other African languages to further advance the field. The study also underscores the importance of cultural context in NLP tasks, a factor that is often overlooked but crucial for accurate and relevant results. Overall, Yor-Sarc is a valuable contribution that paves the way for more inclusive and culturally sensitive NLP research.

Recommendations

  • Expand the Yor-Sarc dataset to include a more diverse range of sarcastic expressions and contexts to enhance its robustness and generalizability.
  • Replicate the study's methodology in other low-resource African languages to create similar gold-standard datasets, fostering a more inclusive NLP landscape.

Sources