Academic

Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

arXiv:2603.18662v1 Announce Type: new Abstract: Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose A

arXiv:2603.18662v1 Announce Type: new Abstract: Geometric reasoning inherently requires "thinking with constructions" -- the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

Executive Summary

This article presents a novel framework for Multimodal Large Language Models (MLLMs) to perform visual-text interleaved chain-of-thought geometric reasoning. The authors introduce GeoAux-Bench, a benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. They demonstrate that interleaved visual-textual aids outperform single-modality counterparts and propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. The approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. The findings have significant implications for the development of more effective geometric reasoning capabilities in MLLMs.

Key Points

  • Introduction of GeoAux-Bench, a benchmark for visual-text interleaved chain-of-thought geometric reasoning
  • Demonstration of the effectiveness of interleaved visual-textual aids in geometric reasoning
  • Proposition of Action Applicability Policy Optimization (A2PO) for mastering strategic construction in MLLMs

Merits

Strength in addressing a critical limitation of existing MLLMs

The article addresses the limitation of existing MLLMs in performing dynamic geometric reasoning, which is essential for many real-world applications.

Novel approach to visual-text interleaved chain-of-thought geometric reasoning

The authors propose a novel framework for MLLMs to perform visual-text interleaved chain-of-thought geometric reasoning, which has the potential to improve the performance of MLLMs in geometric reasoning tasks.

Demerits

Limited generalizability to other domains

The article focuses on geometric reasoning, and it is unclear whether the proposed approach can be generalized to other domains.

Dependence on high-quality training data

The effectiveness of the proposed approach depends on the availability of high-quality training data, which may not always be feasible.

Expert Commentary

This article presents a significant contribution to the field of artificial intelligence, particularly in the area of multimodal large language models. The authors demonstrate a novel approach to visual-text interleaved chain-of-thought geometric reasoning, which has the potential to improve the performance of MLLMs in geometric reasoning tasks. The proposed framework, Action Applicability Policy Optimization (A2PO), is a significant advancement in the field and has the potential to be generalized to other domains. However, the article highlights the importance of high-quality training data and the potential limitations of the proposed approach in other domains. Overall, the article is well-written, and the findings are significant, but further research is needed to fully understand the implications of the proposed approach.

Recommendations

  • Future research should focus on generalizing the proposed approach to other domains, such as natural language processing and computer vision.
  • The authors should investigate the use of transfer learning to adapt the proposed approach to different geometric reasoning tasks.

Sources