Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
arXiv:2602.23898v1 Announce Type: cross Abstract: Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show th
arXiv:2602.23898v1 Announce Type: cross Abstract: Referring Expression Comprehension (REC) links language to region level visual perception. Standard benchmarks (RefCOCO, RefCOCO+, RefCOCOg) have progressed rapidly with multimodal LLMs but remain weak tests of visual reasoning and grounding: (i) many expressions are very short, leaving little reasoning demand; (ii) images often contain few distractors, making the target easy to find; and (iii) redundant descriptors enable shortcut solutions that bypass genuine text understanding and visual reasoning. We introduce Ref-Adv, a modern REC benchmark that suppresses shortcuts by pairing linguistically nontrivial expressions with only the information necessary to uniquely identify the target. The dataset contains referring expressions on real images, curated with hard distractors and annotated with reasoning facets including negation. We conduct comprehensive ablations (word order perturbations and descriptor deletion sufficiency) to show that solving Ref-Adv requires reasoning beyond simple cues, and we evaluate a broad suite of contemporary multimodal LLMs on Ref-Adv. Despite strong results on RefCOCO, RefCOCO+, and RefCOCOg, models drop markedly on Ref-Adv, revealing reliance on shortcuts and gaps in visual reasoning and grounding. We provide an in depth failure analysis and aim for Ref-Adv to guide future work on visual reasoning and grounding in MLLMs.
Executive Summary
This article introduces Ref-Adv, a novel benchmark designed to evaluate the visual reasoning and grounding capabilities of multimodal large language models (MLLMs) in referring expression comprehension tasks. By pairing linguistically nontrivial expressions with minimal visual cues, Ref-Adv suppresses shortcut solutions that bypass genuine text understanding and visual reasoning. The study highlights the limitations of existing benchmarks, such as RefCOCO, RefCOCO+, and RefCOCOg, in testing visual reasoning and grounding. The results demonstrate that MLLMs rely on shortcuts and struggle with genuine visual reasoning and grounding, underscoring the need for more challenging benchmarks like Ref-Adv to guide future research.
Key Points
- ▸ Ref-Adv is a novel benchmark for evaluating the visual reasoning and grounding capabilities of MLLMs in referring expression comprehension tasks.
- ▸ The study highlights the limitations of existing benchmarks in testing visual reasoning and grounding.
- ▸ MLLMs rely on shortcuts and struggle with genuine visual reasoning and grounding.
Merits
Strengths Ref-Adv's Novel Approach
Ref-Adv's design effectively suppresses shortcut solutions, providing a more challenging evaluation of MLLMs' visual reasoning and grounding capabilities.
Comprehensive Evaluation
The study conducts a comprehensive evaluation of a broad suite of contemporary MLLMs on Ref-Adv, providing a thorough assessment of their strengths and weaknesses.
Demerits
Limitation of Dataset Size
The dataset size of Ref-Adv may be limited, which could impact the generalizability of the results and the ability to draw broader conclusions about the performance of MLLMs.
Dependence on Specific Tasks
Ref-Adv is designed for referring expression comprehension tasks, and its results may not generalize to other visual reasoning and grounding tasks.
Expert Commentary
The study's findings have significant implications for the development of more robust and effective MLLMs with improved visual reasoning and grounding capabilities. The introduction of Ref-Adv as a novel benchmark for evaluating MLLMs' visual reasoning and grounding capabilities provides a valuable contribution to the field. However, the study's limitations, such as the dependence on specific tasks and the limited dataset size, should be addressed in future research. The study's conclusions underscore the need for more challenging evaluation metrics and benchmarks to accurately assess the performance of MLLMs in visual reasoning and grounding tasks.
Recommendations
- ✓ Future research should focus on developing more challenging evaluation metrics and benchmarks for visual reasoning and grounding in MLLMs.
- ✓ Developers of MLLMs should prioritize the creation of more robust and effective models with improved visual reasoning and grounding capabilities.