Academic

Egocentric Bias in Vision-Language Models

arXiv:2602.15892v1 Announce Type: cross Abstract: Visual perspective taking--inferring how the world appears from another's viewpoint--is foundational to social cognition. We introduce FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models. The task requires simulating 180-degree rotations of 2D character strings from another agent's perspective, isolating spatial transformation from 3D scene complexity. Evaluating 103 VLMs reveals systematic egocentric bias: the vast majority perform below chance, with roughly three-quarters of errors reproducing the camera viewpoint. Control experiments expose a compositional deficit--models achieve high theory-of-mind accuracy and above-chance mental rotation in isolation, yet fail catastrophically when integration is required. This dissociation indicates that current VLMs lack the mechanisms needed to bind social awareness to spatial operations, suggesting fundamental limitations in model-based spa

Maijunxian Wang, Yijiang Li, Bingyang Wang, Tianwei Zhao, Ran Ji, Qingying Gao, Emmy Liu, Hokin Deng, Dezhi Luo · February 22, 2026 · 1 min read · 7 views

#cs.CV #cs.AI

Executive Summary

This article introduces FlipSet, a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT) in vision-language models (VLMs). The study reveals systematic egocentric bias in 103 VLMs, with most models performing below chance. The results indicate fundamental limitations in model-based spatial reasoning, specifically the inability to bind social awareness to spatial operations. The study also exposes a compositional deficit, where models excel in isolated tasks but fail when integration is required. The findings suggest that current VLMs lack the necessary mechanisms for perspective-taking capabilities, highlighting the need for more cognitively grounded testbeds. The study sheds light on the limitations of current VLMs and provides a framework for future research to address these gaps.

Key Points

▸ Egocentric bias is a fundamental limitation in current vision-language models.
▸ FlipSet provides a diagnostic benchmark for Level-2 visual perspective taking (L2 VPT).
▸ Current VLMs lack the mechanisms needed to bind social awareness to spatial operations.

Merits

Strengths

The study provides a comprehensive analysis of 103 VLMs, offering a thorough understanding of the limitations of current models. The introduction of FlipSet as a diagnostic benchmark is a significant contribution to the field, providing a framework for future research.

Demerits

Limitations

The study focuses primarily on Level-2 VPT, which may not generalize to more complex tasks. Additionally, the study relies on a limited dataset of 103 VLMs, which may not be representative of the broader VLM community.

Expert Commentary

The study provides a nuanced understanding of the limitations of current VLMs and highlights the need for more cognitively grounded testbeds. The introduction of FlipSet as a diagnostic benchmark is a significant contribution to the field, offering a framework for future research. However, the study's focus on Level-2 VPT and reliance on a limited dataset may limit its generalizability. Nevertheless, the study's findings have significant implications for the development of more advanced VLMs and the broader field of artificial intelligence.

Recommendations

✓ Recommendation 1: Researchers should prioritize the development of more advanced VLMs that can better understand human social cognition and spatial reasoning.
✓ Recommendation 2: Policymakers and developers should focus on creating more cognitively grounded testbeds to diagnose perspective-taking capabilities in VLMs.

Sources

arXiv - cs.AI

Something extraordinary is coming.

Egocentric Bias in Vision-Language Models

AI Commentary

Executive Summary

Key Points

Merits

Strengths

Demerits

Limitations

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.