Academic

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

arXiv:2602.23898v1 Announce Type: cross Abstract: Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show th

Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, Yun Fu · March 3, 2026 · 1 min read · 10 views

#cs.CV #cs.AI #cs.CL

Executive Summary

This article introduces Ref-Adv, a novel benchmark designed to evaluate the visual reasoning and grounding capabilities of multimodal large language models (MLLMs) in referring expression comprehension tasks. By pairing linguistically nontrivial expressions with minimal visual cues, Ref-Adv suppresses shortcut solutions that bypass genuine text understanding and visual reasoning. The study highlights the limitations of existing benchmarks, such as RefCOCO, RefCOCO+, and RefCOCOg, in testing visual reasoning and grounding. The results demonstrate that MLLMs rely on shortcuts and struggle with genuine visual reasoning and grounding, underscoring the need for more challenging benchmarks like Ref-Adv to guide future research.

Key Points

▸ Ref-Adv is a novel benchmark for evaluating the visual reasoning and grounding capabilities of MLLMs in referring expression comprehension tasks.
▸ The study highlights the limitations of existing benchmarks in testing visual reasoning and grounding.
▸ MLLMs rely on shortcuts and struggle with genuine visual reasoning and grounding.

Merits

Strengths Ref-Adv's Novel Approach

Ref-Adv's design effectively suppresses shortcut solutions, providing a more challenging evaluation of MLLMs' visual reasoning and grounding capabilities.

Comprehensive Evaluation

The study conducts a comprehensive evaluation of a broad suite of contemporary MLLMs on Ref-Adv, providing a thorough assessment of their strengths and weaknesses.

Demerits

Limitation of Dataset Size

The dataset size of Ref-Adv may be limited, which could impact the generalizability of the results and the ability to draw broader conclusions about the performance of MLLMs.

Dependence on Specific Tasks

Ref-Adv is designed for referring expression comprehension tasks, and its results may not generalize to other visual reasoning and grounding tasks.

Expert Commentary

The study's findings have significant implications for the development of more robust and effective MLLMs with improved visual reasoning and grounding capabilities. The introduction of Ref-Adv as a novel benchmark for evaluating MLLMs' visual reasoning and grounding capabilities provides a valuable contribution to the field. However, the study's limitations, such as the dependence on specific tasks and the limited dataset size, should be addressed in future research. The study's conclusions underscore the need for more challenging evaluation metrics and benchmarks to accurately assess the performance of MLLMs in visual reasoning and grounding tasks.

Recommendations

✓ Future research should focus on developing more challenging evaluation metrics and benchmarks for visual reasoning and grounding in MLLMs.
✓ Developers of MLLMs should prioritize the creation of more robust and effective models with improved visual reasoning and grounding capabilities.

Sources

arXiv - cs.CL

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

AI Commentary

Executive Summary

Key Points

Merits

Strengths Ref-Adv's Novel Approach

Comprehensive Evaluation

Demerits

Limitation of Dataset Size

Dependence on Specific Tasks

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs