Academic

M-QUEST -- Meme Question-Understanding Evaluation on Semantics and Toxicity

arXiv:2603.03315v1 Announce Type: cross Abstract: Internet memes are a powerful form of online communication, yet their nature and reliance on commonsense knowledge make toxicity detection challenging. Identifying key features for meme interpretation and understanding, is a crucial task. Previous work has been focused on some elements contributing to the meaning, such as the Textual dimension via OCR, the Visual dimension via object recognition, upper layers of meaning like the Emotional dimension, Toxicity detection via proxy variables, such as hate speech detection, and sentiment analysis. Nevertheless, there is still a lack of an overall architecture able to formally identify elements contributing to the meaning of a meme, and be used in the sense-making process. In this work, we present a semantic framework and a corresponding benchmark for automatic knowledge extraction from memes. First, we identify the necessary dimensions to understand and interpret a meme: Textual material, V

S
Stefano De Giorgis, Ting-Chih Chen, Filip Ilievski
· · 1 min read · 9 views

arXiv:2603.03315v1 Announce Type: cross Abstract: Internet memes are a powerful form of online communication, yet their nature and reliance on commonsense knowledge make toxicity detection challenging. Identifying key features for meme interpretation and understanding, is a crucial task. Previous work has been focused on some elements contributing to the meaning, such as the Textual dimension via OCR, the Visual dimension via object recognition, upper layers of meaning like the Emotional dimension, Toxicity detection via proxy variables, such as hate speech detection, and sentiment analysis. Nevertheless, there is still a lack of an overall architecture able to formally identify elements contributing to the meaning of a meme, and be used in the sense-making process. In this work, we present a semantic framework and a corresponding benchmark for automatic knowledge extraction from memes. First, we identify the necessary dimensions to understand and interpret a meme: Textual material, Visual material, Scene, Background Knowledge, Emotion, Semiotic Projection, Analogical Mapping, Overall Intent, Target Community, and Toxicity Assessment. Second, the framework guides a semi-automatic process of generating a benchmark with commonsense question-answer pairs about meme toxicity assessment and its underlying reason. The resulting benchmark M-QUEST consists of 609 question-answer pairs for 307 memes. Thirdly, we evaluate eight open-source large language models on their ability to correctly solve M-QUEST. Our results show that current models' commonsense reasoning capabilities for toxic meme interpretation vary depending on the dimension and architecture. Models with instruction tuning and reasoning capabilities significantly outperform the others, though pragmatic inference questions remain challenging. We release code, benchmark, and prompts to support future research intersecting multimodal content safety and commonsense reasoning.

Executive Summary

This article presents M-QUEST, a semantic framework and benchmark for automatic knowledge extraction from memes. The framework identifies 10 dimensions necessary to understand and interpret memes, including Textual material, Visual material, and Toxicity Assessment. A benchmark of 609 question-answer pairs is generated for 307 memes, and eight large language models are evaluated on their ability to solve M-QUEST. The results show that current models' commonsense reasoning capabilities vary depending on the dimension and architecture. The study highlights the challenges in toxic meme interpretation, particularly in pragmatic inference questions. The release of code, benchmark, and prompts supports future research on multimodal content safety and commonsense reasoning.

Key Points

  • M-QUEST is a semantic framework and benchmark for automatic knowledge extraction from memes.
  • The framework identifies 10 dimensions necessary to understand and interpret memes.
  • Current large language models struggle with pragmatic inference questions in toxic meme interpretation.

Merits

Comprehensive Approach

The study takes a comprehensive approach to understanding and interpreting memes by identifying 10 dimensions and generating a benchmark for automatic knowledge extraction.

Practical Contributions

The release of code, benchmark, and prompts supports future research on multimodal content safety and commonsense reasoning.

Insights into Commonsense Reasoning

The study provides insights into the commonsense reasoning capabilities of current large language models and highlights the challenges in toxic meme interpretation.

Demerits

Limited Model Evaluation

The study evaluates only eight large language models, which may not be representative of the broader range of models available.

Focus on Toxicity Detection

The study focuses primarily on toxicity detection, which may not capture the full range of meme interpretation challenges.

Methodological Limitations

The study relies on question-answer pairs generated by humans, which may introduce biases and limitations.

Expert Commentary

The M-QUEST framework and benchmark provide a significant contribution to the field of multimodal content safety and commonsense reasoning. The study highlights the challenges in toxic meme interpretation and provides insights into the commonsense reasoning capabilities of current large language models. The release of code, benchmark, and prompts supports future research and development of more effective content moderation tools and strategies. However, the study's limitations in model evaluation, focus on toxicity detection, and methodological limitations should be addressed in future research.

Recommendations

  • Develop more comprehensive and diverse benchmarks for automatic knowledge extraction from memes.
  • Further investigate the limitations and biases of human-generated question-answer pairs and explore alternative methods for generating benchmarks.

Sources