Skip to main content
Academic

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning

arXiv:2602.23351v1 Announce Type: new Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii

arXiv:2602.23351v1 Announce Type: new Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.

Executive Summary

The article 'Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning' investigates the limitations of Vision-Language Models (VLMs) in reasoning tasks. The authors argue that reporting bias in training data, where tacit information is often omitted, leads to poor performance in spatial, temporal, negation, and counting reasoning. Despite the scale of data and model size, these reasoning skills do not emerge by default. The study demonstrates that intentional curation of training data, including annotations that capture tacit information, significantly improves performance. The findings underscore the importance of thoughtful data curation over relying solely on scale for enhancing model capabilities.

Key Points

  • Reporting bias in training data leads to insufficient representation of key reasoning skills in VLMs.
  • Scaling data size, model size, and multilingual approaches do not automatically improve reasoning capabilities.
  • Intentional curation of training data with annotations for tacit information is effective in enhancing reasoning skills.

Merits

Empirical Rigor

The study provides a rigorous empirical analysis of VLMs, supported by curated benchmarks and detailed examination of popular models.

Theoretical Insight

The article draws on theories from pragmatics to explain the limitations of VLMs, offering a novel perspective on the role of reporting bias.

Practical Implications

The findings offer actionable insights for improving VLM training, emphasizing the need for intentional data curation over scale.

Demerits

Scope Limitation

The study focuses on a limited set of reasoning skills, which may not encompass the full spectrum of reasoning capabilities required for VLMs.

Generalizability

The conclusions are based on specific models and benchmarks, which may not be generalizable to all VLMs or reasoning tasks.

Expert Commentary

The article presents a compelling argument that the limitations of Vision-Language Models in reasoning tasks are deeply rooted in the reporting bias present in their training data. The authors' thorough analysis of popular models like OpenCLIP, LLaVA-1.5, and Molmo, through the lens of pragmatics, provides a fresh perspective on why scaling data and model size does not automatically lead to improved reasoning capabilities. The study's findings are particularly significant as they challenge the prevalent belief in the AI community that larger datasets and more complex models will inherently solve complex reasoning tasks. Instead, the authors demonstrate that intentional curation of training data, with a focus on capturing tacit information, is crucial for enhancing model performance. This insight is not only academically valuable but also has profound practical implications for AI developers and policymakers. The study underscores the need for a more nuanced approach to AI development, one that prioritizes data quality and representation over sheer scale. As AI continues to integrate into various aspects of society, ensuring that models are trained on comprehensive and unbiased data will be essential for their ethical and effective deployment.

Recommendations

  • Future research should explore additional reasoning skills and their representation in training data to provide a more holistic understanding of VLM capabilities.
  • Developers should invest in creating and utilizing annotated datasets that capture tacit information to improve the reasoning skills of VLMs.

Sources