Skip to main content
Academic

Retrieval Augmented Enhanced Dual Co-Attention Framework for Target Aware Multimodal Bengali Hateful Meme Detection

arXiv:2602.19212v1 Announce Type: new Abstract: Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further

R
Raihan Tanvir, Md. Golam Rabiul Alam
· · 1 min read · 3 views

arXiv:2602.19212v1 Announce Type: new Abstract: Hateful content on social media increasingly appears as multimodal memes that combine images and text to convey harmful narratives. In low-resource languages such as Bengali, automated detection remains challenging due to limited annotated data, class imbalance, and pervasive code-mixing. To address these issues, we augment the Bengali Hateful Memes (BHM) dataset with semantically aligned samples from the Multimodal Aggression Dataset in Bengali (MIMOSA), improving both class balance and semantic diversity. We propose the Enhanced Dual Co-attention Framework (xDORA), integrating vision encoders (CLIP, DINOv2) and multilingual text encoders (XGLM, XLM-R) via weighted attention pooling to learn robust cross-modal representations. Building on these embeddings, we develop a FAISS-based k-nearest neighbor classifier for non-parametric inference and introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. We further evaluate LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings. Experiments on the extended dataset show that xDORA (CLIP + XLM-R) achieves macro-average F1-scores of 0.78 for hateful meme identification and 0.71 for target entity detection, while RAG-Fused DORA improves performance to 0.79 and 0.74, yielding gains over the DORA baseline. The FAISS-based classifier performs competitively and demonstrates robustness for rare classes through semantic similarity modeling. In contrast, LLaVA exhibits limited effectiveness in few-shot settings, with only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning. These findings demonstrate the effectiveness of supervised, retrieval-augmented, and non-parametric multimodal frameworks for addressing linguistic and cultural complexities in low-resource hate speech detection.

Executive Summary

The article presents a novel framework, Enhanced Dual Co-attention Framework (xDORA), for detecting hateful memes in Bengali, a low-resource language. The authors address challenges such as limited annotated data, class imbalance, and code-mixing by augmenting the Bengali Hateful Memes (BHM) dataset with samples from the Multimodal Aggression Dataset in Bengali (MIMOSA). The proposed framework integrates vision and text encoders via weighted attention pooling and employs a FAISS-based k-nearest neighbor classifier for non-parametric inference. Additionally, the authors introduce RAG-Fused DORA, which incorporates retrieval-driven contextual reasoning. Experiments demonstrate significant improvements in hateful meme identification and target entity detection, with RAG-Fused DORA achieving the highest performance. The study also evaluates the effectiveness of LLaVA under various prompting settings, highlighting its limitations in few-shot settings for code-mixed Bengali content.

Key Points

  • Augmentation of the BHM dataset with MIMOSA samples to improve class balance and semantic diversity.
  • Integration of vision and text encoders via weighted attention pooling in the xDORA framework.
  • Use of a FAISS-based k-nearest neighbor classifier for non-parametric inference.
  • Introduction of RAG-Fused DORA for retrieval-driven contextual reasoning.
  • Evaluation of LLaVA under zero-shot, few-shot, and retrieval-augmented prompting settings.

Merits

Innovative Framework

The xDORA framework is innovative in its integration of vision and text encoders, leveraging weighted attention pooling to learn robust cross-modal representations.

Effective Data Augmentation

The augmentation of the BHM dataset with MIMOSA samples effectively addresses the challenges of limited annotated data and class imbalance in low-resource languages.

Robust Performance

The RAG-Fused DORA framework achieves significant improvements in hateful meme identification and target entity detection, demonstrating robustness and effectiveness.

Demerits

Limited Effectiveness of LLaVA

LLaVA shows limited effectiveness in few-shot settings and only modest improvements under retrieval augmentation, highlighting constraints of pretrained vision-language models for code-mixed Bengali content without fine-tuning.

Dependency on Data Augmentation

The effectiveness of the proposed framework is heavily dependent on the quality and relevance of the augmented data, which may not always be available or applicable in other low-resource languages.

Complexity of Implementation

The implementation of the xDORA and RAG-Fused DORA frameworks is complex and may require significant computational resources and expertise, limiting its accessibility for broader application.

Expert Commentary

The article presents a rigorous and well-reasoned approach to addressing the challenges of hateful meme detection in low-resource languages. The integration of vision and text encoders via weighted attention pooling in the xDORA framework is a notable advancement, demonstrating the potential of multimodal frameworks in improving the detection of hateful content. The augmentation of the BHM dataset with MIMOSA samples effectively addresses the issues of limited annotated data and class imbalance, providing a more robust and diverse dataset for training. The introduction of RAG-Fused DORA further enhances the framework's performance, highlighting the benefits of retrieval-driven contextual reasoning. However, the limited effectiveness of LLaVA in few-shot settings underscores the constraints of pretrained vision-language models in handling code-mixed content without fine-tuning. This limitation suggests the need for further research and development in this area. Overall, the study makes a significant contribution to the field of multimodal hate speech detection, offering valuable insights and practical implications for both researchers and policymakers.

Recommendations

  • Further research should explore the applicability of the proposed framework to other low-resource languages, adapting the data augmentation techniques and retrieval-augmented generation methods as needed.
  • Investigation into the fine-tuning of pretrained vision-language models for code-mixed content could enhance their effectiveness in few-shot settings, addressing the limitations highlighted in the study.

Sources