ViewFusion: Structured Spatial Thinking Chains for Multi-View Reasoning
arXiv:2603.06024v1 Announce Type: new Abstract: Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intend
arXiv:2603.06024v1 Announce Type: new Abstract: Multi-view spatial reasoning remains difficult for current vision-language models. Even when multiple viewpoints are available, models often underutilize cross-view relations and instead rely on single-image shortcuts, leading to fragile performance on viewpoint transformation and occlusion-sensitive cases. We present ViewFusion, a two-stage framework that explicitly separates cross-view spatial pre-alignment from question answering. In the first stage, the model performs deliberate spatial pre-thinking to infer viewpoint relations and spatial transformations across views, forming an intermediate workspace that goes beyond a simple re-description. In the second stage, the model conducts question-driven reasoning conditioned on this workspace to produce the final prediction. We train ViewFusion with synthetic reasoning supervision followed by reinforcement learning using GRPO, which improves answer correctness while stabilizing the intended two-stage generation behavior. On MMSI-Bench, ViewFusion improves accuracy by 5.3\% over Qwen3-VL-4B-Instruct, with the largest gains on examples that require genuine cross-view alignment.
Executive Summary
The article introduces ViewFusion, a two-stage framework for multi-view spatial reasoning that improves the performance of vision-language models. By explicitly separating cross-view spatial pre-alignment from question answering, ViewFusion achieves a 5.3% increase in accuracy over existing models on the MMSI-Bench dataset. The framework's two-stage approach enables deliberate spatial pre-thinking and question-driven reasoning, making it particularly effective for examples that require genuine cross-view alignment.
Key Points
- ▸ ViewFusion is a two-stage framework for multi-view spatial reasoning
- ▸ The framework separates cross-view spatial pre-alignment from question answering
- ▸ ViewFusion improves accuracy by 5.3% over existing models on the MMSI-Bench dataset
Merits
Improved Accuracy
ViewFusion's two-stage approach leads to improved accuracy, particularly for examples that require genuine cross-view alignment.
Effective Use of Cross-View Relations
The framework's explicit separation of cross-view spatial pre-alignment from question answering enables more effective use of cross-view relations.
Demerits
Complexity
The two-stage approach may increase the complexity of the model, potentially leading to longer training times and increased computational requirements.
Expert Commentary
The introduction of ViewFusion marks a significant advancement in the field of multi-view spatial reasoning. By explicitly addressing the limitations of existing vision-language models, the authors have developed a framework that can effectively leverage cross-view relations to improve performance. The two-stage approach is particularly noteworthy, as it enables deliberate spatial pre-thinking and question-driven reasoning. However, further research is needed to fully explore the potential of ViewFusion and its applications in various domains.
Recommendations
- ✓ Further research is needed to explore the potential of ViewFusion in various domains, such as robotics and autonomous vehicles.
- ✓ The development of more efficient training methods is necessary to reduce the complexity and computational requirements of the two-stage approach.