Academic

BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages

arXiv:2603.00634v1 Announce Type: new Abstract: Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource "big-head" (20) and low-resource "long-tail" (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English$\leftrightarrow$X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-edi

arXiv:2603.00634v1 Announce Type: new Abstract: Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource "big-head" (20) and low-resource "long-tail" (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English$\leftrightarrow$X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chainof-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-theart detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and opensource tools to advance equitable falsehood detection. Dataset and code are available at: https://jsl5710.github.io/BLUFF/

Executive Summary

The article introduces BLUFF, a comprehensive benchmark for detecting false and synthetic content across 79 languages, addressing the critical gap in multilingual research. The dataset features over 202K samples, including human-written and LLM-generated content, and covers both high-resource and low-resource languages. The study reveals that state-of-the-art detectors suffer significant performance degradation on low-resource languages, highlighting the need for equitable falsehood detection.

Key Points

  • BLUFF is a multilingual benchmark for detecting false and synthetic content
  • The dataset covers 79 languages, including 59 low-resource languages
  • State-of-the-art detectors suffer up to 25.3% F1 degradation on low-resource languages

Merits

Comprehensive Dataset

The BLUFF dataset is extensive, covering a wide range of languages and content types, making it a valuable resource for researchers.

Novel Framework

The AXL-CoI framework and mPURIFY pipeline provide a controlled and quality-filtered approach to generating and evaluating fake and real news.

Demerits

Limited Generalizability

The performance degradation of state-of-the-art detectors on low-resource languages may be specific to the BLUFF dataset and not generalizable to other datasets or real-world scenarios.

Expert Commentary

The BLUFF benchmark is a significant contribution to the field of multilingual research, highlighting the critical need for equitable falsehood detection. The study's findings have important implications for the development of more robust AI models and the need for policymakers to address language bias in AI. However, further research is needed to address the limitations of the study and ensure that the findings are generalizable to real-world scenarios. The BLUFF dataset and framework provide a valuable resource for researchers to build upon and advance the field of multilingual falsehood detection.

Recommendations

  • Further research is needed to develop more robust and equitable falsehood detection models that can perform well across languages and datasets.
  • Policymakers should prioritize addressing language bias in AI and ensuring that AI systems are designed and developed with inclusivity and equity in mind.

Sources