Skip to main content
Academic

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

arXiv:2602.21608v1 Announce Type: new Abstract: Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense,

arXiv:2602.21608v1 Announce Type: new Abstract: Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. Zero-shot models achieve competitive micro-F1 scores but low exact match accuracy. Further analysis reveals that over 42\% of negative sentiment instances in an external dataset exhibit sarcastic characteristics. MixSarc provides a foundational resource for culturally aware NLP and supports more reliable multi-label modeling in code-mixed environments.

Executive Summary

This article introduces MixSarc, a novel Bangla-English code-mixed corpus for implicit meaning identification, addressing the scarcity of resources for sentiment analysis in South Asian social media. The corpus comprises 9,087 manually annotated sentences, labeled for humor, sarcasm, offensiveness, and vulgarity, and demonstrates strong performance on humor detection but struggles with sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. The study also highlights the limitations of zero-shot large language models in code-mixed environments and proposes MixSarc as a foundational resource for culturally aware NLP. The results and insights from this study have significant implications for the development of more reliable multi-label modeling in code-mixed environments and support the need for culturally aware NLP systems.

Key Points

  • MixSarc is a novel Bangla-English code-mixed corpus for implicit meaning identification
  • The corpus comprises 9,087 manually annotated sentences, labeled for humor, sarcasm, offensiveness, and vulgarity
  • The study highlights the limitations of zero-shot large language models in code-mixed environments

Merits

Strength of the corpus

The corpus is the first publicly available Bangla-English code-mixed corpus for implicit meaning identification, addressing the scarcity of resources for sentiment analysis in South Asian social media.

Cultural awareness

The study highlights the importance of culturally aware NLP systems, which can better understand and handle code-mixed text, reducing the risk of misinterpretation and miscommunication.

Novel approach

The study proposes a novel approach to sentiment analysis, using multi-annotator validation and structured prompting to improve the accuracy of zero-shot large language models.

Demerits

Class imbalance

The study highlights the class imbalance issue in the corpus, where humor detection performs well but sarcasm, offense, and vulgarity detection struggle, indicating the need for more data and annotation efforts.

Pragmatic complexity

The study highlights the pragmatic complexity of code-mixed text, which can lead to misinterpretation and miscommunication, emphasizing the need for more sophisticated NLP systems.

Limited generalizability

The study's findings may not be generalizable to other languages or code-mixed environments, highlighting the need for more research and cross-lingual comparisons.

Expert Commentary

The study represents a significant contribution to the field of sentiment analysis, particularly in the context of code-mixed text. The introduction of MixSarc, a novel corpus for implicit meaning identification, provides a valuable resource for researchers and developers working in this area. However, the study's findings also highlight the limitations of current NLP systems and the need for more sophisticated approaches to sentiment analysis. The implications of this study are far-reaching, with potential applications in a range of areas, including social media, customer service, and healthcare. As the study highlights, the development of culturally aware NLP systems is essential for more effective communication in diverse social media environments. The study's results and insights provide a foundation for future research and development in this area, and its findings have significant implications for the field of NLP as a whole.

Recommendations

  • Future research should focus on developing more sophisticated approaches to sentiment analysis, including the use of multi-annotator validation and structured prompting to improve the accuracy of zero-shot large language models.
  • The development of culturally aware NLP systems is essential for more effective communication in diverse social media environments, and researchers and developers should prioritize the development of such systems in the future.

Sources