Academic

Recursive Belief Vision Language Model

arXiv:2602.20659v1 Announce Type: new Abstract: Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and

V
Vaidehi Bagaria, Bijo Sebastian, Nirav Patel
· · 1 min read · 11 views

arXiv:2602.20659v1 Announce Type: new Abstract: Current vision-language-action (VLA) models struggle with long-horizon manipulation under partial observability. Most existing approaches remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs). This leads to loss of task progress, action repetition under perceptual aliasing, and high inference latency. Semantic reasoning alone is not the primary bottleneck in long-horizon manipulation. Instead, VLAs lack persistent, action-conditioned state representations and exhibit limited temporal and physical reasoning, making them ill-suited for multi-stage control. This paper introduces RB-VLA, a belief-centric architecture trained with self-supervised world-model objectives that maintains a compact latent state encoding task-relevant history, dynamics, and object interactions. Queried once for high-level intent, the VLM provides task specification, while the belief tracks task progress and enables phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The belief and intent jointly condition a diffusion policy for robust closed-loop execution. RB-VLA outperforms prior VLAs on long-horizon benchmarks, achieving 52.5% and 37.5% higher success on multi-stage pick-and-place and stacking tasks, respectively, compared to {\pi}0. It also reduces inference latency by up to 5x relative to baselines and eliminates memory growth across timesteps observed in existing VLAs. Ablations show that the belief module is the primary driver of performance, increasing success rates from 32.5% to 77.5%. These results demonstrate the effectiveness of belief-based state representations for long-horizon VLA policies.

Executive Summary

The article introduces Recursive Belief Vision Language Model (RB-VLA), a novel architecture that addresses the limitations of current vision-language-action (VLA) models. RB-VLA utilizes self-supervised world-model objectives to maintain a compact latent state encoding of task-relevant history, dynamics, and object interactions. This allows for phase-aware, causally grounded control under partial observability without storing raw observations or scaling memory with time. The authors demonstrate the effectiveness of RB-VLA on long-horizon benchmarks, achieving significant improvements in success rates and reducing inference latency. The findings highlight the importance of belief-based state representations for long-horizon VLA policies.

Key Points

  • RB-VLA introduces a belief-centric architecture for VLA models
  • The model uses self-supervised world-model objectives to encode task-relevant history and dynamics
  • RB-VLA achieves significant improvements in success rates and inference latency on long-horizon benchmarks

Merits

Strength in addressing current limitations

RB-VLA effectively addresses the limitations of current VLA models, including loss of task progress, action repetition, and high inference latency.

Improved performance on long-horizon tasks

The model achieves significant improvements in success rates on multi-stage pick-and-place and stacking tasks, outperforming prior VLAs.

Reduced inference latency and memory growth

RB-VLA reduces inference latency by up to 5x and eliminates memory growth across timesteps, making it a more efficient and scalable solution.

Demerits

Limited evaluation on diverse tasks

The authors focus on multi-stage pick-and-place and stacking tasks, and it is unclear how RB-VLA performs on more diverse tasks or real-world applications.

Lack of interpretability and explainability

The model's decision-making process and the role of the belief module in task success rates are not fully understood, limiting the model's interpretability and explainability.

Expert Commentary

The introduction of RB-VLA marks a significant advancement in the field of VLA models. The authors' use of self-supervised world-model objectives and belief-centric architecture is a promising approach to addressing the limitations of current VLA models. However, further research is needed to fully understand the model's decision-making process and to evaluate its performance on more diverse tasks and real-world applications. The findings of this study have important implications for the development of more advanced object manipulation and robotics systems, as well as the advancement of artificial intelligence and machine learning.

Recommendations

  • Future research should focus on evaluating RB-VLA on more diverse tasks and real-world applications.
  • The authors should provide more detailed insights into the model's decision-making process and the role of the belief module in task success rates.

Sources