Egocentric Bias in Vision-Language Models
arXiv:2602.15892v1 Announce Type: cross Abstract: Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spa
arXiv:2602.15892v1 Announce Type: cross Abstract: Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spatial reasoning. FlipSet provides a cognitively grounded testbed for diagnosing perspective-taking capabilities in multimodal systems.
Executive Summary
This article introduces FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models (VLMs). The study reveals systematic egocentric bias in 103 VLMs, with most models performing below chance. The results indicate fundamental limitations in model-based spatial reasoning, specifically the inability to bind social awareness to spatial operations. The study also exposes a compositional deficit, where models excel in isolated tasks but fail when integration is required. The findings suggest that current VLMs lack the necessary mechanisms for perspective-taking capabilities, highlighting the need for more cognitively grounded testbeds. The study sheds light on the limitations of current VLMs and provides a framework for future research to address these gaps.
Key Points
- ▸ Egocentric bias is a fundamental limitation in current vision-language models.
- ▸ FlipSet provides a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT).
- ▸ Current VLMs lack the mechanisms needed to bind social awareness to spatial operations.
Merits
Strengths
The study provides a comprehensive analysis of 103 VLMs, offering a thorough understanding of the limitations of current models. The introduction of FlipSet as a diagnostic benchmark is a significant contribution to the field, providing a framework for future research.
Demerits
Limitations
The study focuses primarily on Level-2 VPT, which may not generalize to more complex tasks. Additionally, the study relies on a limited dataset of 103 VLMs, which may not be representative of the broader VLM community.
Expert Commentary
The study provides a nuanced understanding of the limitations of current VLMs and highlights the need for more cognitively grounded testbeds. The introduction of FlipSet as a diagnostic benchmark is a significant contribution to the field, offering a framework for future research. However, the study's focus on Level-2 VPT and reliance on a limited dataset may limit its generalizability. Nevertheless, the study's findings have significant implications for the development of more advanced VLMs and the broader field of artificial intelligence.
Recommendations
- ✓ Recommendation 1: Researchers should prioritize the development of more advanced VLMs that can better understand human social cognition and spatial reasoning.
- ✓ Recommendation 2: Policymakers and developers should focus on creating more cognitively grounded testbeds to diagnose perspective-taking capabilities in VLMs.