Skip to main content
Academic

MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust Check-Worthiness Detection Models

arXiv:2602.16298v1 Announce Type: new Abstract: Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistentl

arXiv:2602.16298v1 Announce Type: new Abstract: Large Language Models (LLMs) are beginning to reshape how media professionals verify information, yet automated support for detecting check-worthy claims a key step in the fact-checking process remains limited. We introduce the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection spanning 16 languages, 7 topical domains, and 2 writing styles. It consists of 123,722 samples, evenly distributed between noisy (informal) and structured (formal) texts, with balanced representation of check-worthy and non-check-worthy classes across all languages. To probe robustness, we also introduce an equally balanced out-of-distribution evaluation set of 27,761 samples in 4 additional languages. To provide baselines, we benchmark 3 common fine-tuned multilingual transformers against a diverse set of 15 commercial and open LLMs under zero-shot settings. Our findings show that fine-tuned models consistently outperform zero-shot LLMs on claim classification and show strong out-of-distribution generalization across languages, domains, and styles. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs on the check-worthy claim detection task.

Executive Summary

The article introduces the Multi-Check-Worthy (MultiCW) dataset, a balanced multilingual benchmark for check-worthy claim detection. It comprises 123,722 samples across 16 languages, 7 topical domains, and 2 writing styles. The dataset includes an out-of-distribution evaluation set for robustness testing. The authors benchmark 15 commercial and open Large Language Models (LLMs) under zero-shot settings, demonstrating that fine-tuned models outperform zero-shot LLMs and exhibit strong out-of-distribution generalization. MultiCW provides a rigorous multilingual resource for advancing automated fact-checking and enables systematic comparisons between fine-tuned models and cutting-edge LLMs. This dataset has significant implications for the development of robust check-worthy claim detection models, particularly in the context of misinformation and disinformation.

Key Points

  • Introduction of the Multi-Check-Worthy (MultiCW) dataset for check-worthy claim detection
  • Balanced multilingual benchmark with 123,722 samples across 16 languages, 7 topical domains, and 2 writing styles
  • Out-of-distribution evaluation set for robustness testing and fine-tuned model benchmarking

Merits

Strength

The MultiCW dataset provides a rigorous multilingual resource for advancing automated fact-checking, enabling systematic comparisons between fine-tuned models and cutting-edge LLMs.

Robustness

The out-of-distribution evaluation set and fine-tuned model benchmarking demonstrate the robustness of the dataset and its ability to generalize across languages, domains, and styles.

Demerits

Limitation

The dataset may not account for emerging languages or dialects, potentially limiting its applicability in certain contexts.

Scalability

The dataset's size, while substantial, may not be sufficient to capture the full complexity of real-world check-worthy claim detection scenarios.

Expert Commentary

The MultiCW dataset represents a significant step forward in the development of robust check-worthy claim detection models. Its balanced multilingual design and out-of-distribution evaluation set provide a rigorous testing ground for fine-tuned models and cutting-edge LLMs. As misinformation and disinformation continue to pose significant threats to public discourse, the dataset's implications for automated fact-checking are particularly timely. However, its limitations, such as potential biases in the dataset or scalability concerns, must be carefully considered. Nevertheless, the MultiCW dataset offers a valuable resource for researchers and developers seeking to advance the field of check-worthy claim detection.

Recommendations

  • Future research should prioritize the development of fine-tuned models that can generalize across languages, domains, and styles, leveraging the MultiCW dataset for robustness testing and benchmarking.
  • The dataset's implications for fact-checking and misinformation detection should be explored in the context of policy and regulation, informing decisions regarding online content moderation and media literacy initiatives.

Sources