Academic

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

arXiv:2602.21608v1 Announce Type: new Abstract: Bangla-English code-mixing is widespread across South Asian social media, yet resources for implicit meaning identification in this setting remain scarce. Existing sentiment and sarcasm models largely focus on monolingual English or high-resource languages and struggle with transliteration variation, cultural references, and intra-sentential language switching. To address this gap, we introduce MixSarc, the first publicly available Bangla-English code-mixed corpus for implicit meaning identification. The dataset contains 9,087 manually annotated sentences labeled for humor, sarcasm, offensiveness, and vulgarity. We construct the corpus through targeted social media collection, systematic filtering, and multi-annotator validation. We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting. Results show strong performance on humor detection but substantial degradation on sarcasm, offense,

Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · February 27, 2026 · 1 min read · 4 views

#cs.CL

Executive Summary

This article introduces MixSarc, a novel Bangla-English code-mixed corpus for implicit meaning identification, addressing the scarcity of resources for sentiment analysis in South Asian social media. The corpus comprises 9,087 manually annotated sentences, labeled for humor, sarcasm, offensiveness, and vulgarity, and demonstrates strong performance on humor detection but struggles with sarcasm, offense, and vulgarity due to class imbalance and pragmatic complexity. The study also highlights the limitations of zero-shot large language models in code-mixed environments and proposes MixSarc as a foundational resource for culturally aware NLP. The results and insights from this study have significant implications for the development of more reliable multi-label modeling in code-mixed environments and support the need for culturally aware NLP systems.

Key Points

▸ MixSarc is a novel Bangla-English code-mixed corpus for implicit meaning identification
▸ The corpus comprises 9,087 manually annotated sentences, labeled for humor, sarcasm, offensiveness, and vulgarity
▸ The study highlights the limitations of zero-shot large language models in code-mixed environments

Merits

Strength of the corpus

The corpus is the first publicly available Bangla-English code-mixed corpus for implicit meaning identification, addressing the scarcity of resources for sentiment analysis in South Asian social media.

Cultural awareness

The study highlights the importance of culturally aware NLP systems, which can better understand and handle code-mixed text, reducing the risk of misinterpretation and miscommunication.

Novel approach

The study proposes a novel approach to sentiment analysis, using multi-annotator validation and structured prompting to improve the accuracy of zero-shot large language models.

Demerits

Class imbalance

The study highlights the class imbalance issue in the corpus, where humor detection performs well but sarcasm, offense, and vulgarity detection struggle, indicating the need for more data and annotation efforts.

Pragmatic complexity

The study highlights the pragmatic complexity of code-mixed text, which can lead to misinterpretation and miscommunication, emphasizing the need for more sophisticated NLP systems.

Limited generalizability

The study's findings may not be generalizable to other languages or code-mixed environments, highlighting the need for more research and cross-lingual comparisons.

Expert Commentary

The study represents a significant contribution to the field of sentiment analysis, particularly in the context of code-mixed text. The introduction of MixSarc, a novel corpus for implicit meaning identification, provides a valuable resource for researchers and developers working in this area. However, the study's findings also highlight the limitations of current NLP systems and the need for more sophisticated approaches to sentiment analysis. The implications of this study are far-reaching, with potential applications in a range of areas, including social media, customer service, and healthcare. As the study highlights, the development of culturally aware NLP systems is essential for more effective communication in diverse social media environments. The study's results and insights provide a foundation for future research and development in this area, and its findings have significant implications for the field of NLP as a whole.

Recommendations

✓ Future research should focus on developing more sophisticated approaches to sentiment analysis, including the use of multi-annotator validation and structured prompting to improve the accuracy of zero-shot large language models.
✓ The development of culturally aware NLP systems is essential for more effective communication in diverse social media environments, and researchers and developers should prioritize the development of such systems in the future.

Sources

arXiv - cs.CL

Something extraordinary is coming.

MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification

AI Commentary

Executive Summary

Key Points

Merits

Strength of the corpus

Cultural awareness

Novel approach

Demerits

Class imbalance

Pragmatic complexity

Limited generalizability

Expert Commentary

Recommendations

Sources

Related Articles

Uncovering Context Reliance in Unstructured Knowledge Editing

Using AI in Dance Notation and Copyright Infringement Prevention: Enhancing …

Multilevel Determinants of Overweight and Obesity Among U.S. Children Aged …

An artificial intelligence framework for end-to-end rare disease phenotyping from …

JCG, PC

HSOLLC Co., Ltd.