Academic

PyVision-RL: Forging Open Agentic Vision Models via RL

arXiv:2602.20739v1 Announce Type: new Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical

arXiv:2602.20739v1 Announce Type: new Abstract: Reinforcement learning for agentic multimodal models often suffers from interaction collapse, where models learn to reduce tool usage and multi-turn reasoning, limiting the benefits of agentic behavior. We introduce PyVision-RL, a reinforcement learning framework for open-weight multimodal models that stabilizes training and sustains interaction. Our approach combines an oversampling-filtering-ranking rollout strategy with an accumulative tool reward to prevent collapse and encourage multi-turn tool use. Using a unified training pipeline, we develop PyVision-Image and PyVision-Video for image and video understanding. For video reasoning, PyVision-Video employs on-demand context construction, selectively sampling task-relevant frames during reasoning to significantly reduce visual token usage. Experiments show strong performance and improved efficiency, demonstrating that sustained interaction and on-demand visual processing are critical for scalable multimodal agents.

Executive Summary

The article introduces PyVision-RL, a reinforcement learning framework designed to stabilize training and promote sustained interaction in multimodal models. By combining an oversampling-filtering-ranking rollout strategy with an accumulative tool reward, PyVision-RL prevents interaction collapse and encourages multi-turn tool use. The framework is applied to image and video understanding, with PyVision-Video employing on-demand context construction to improve efficiency. The results demonstrate strong performance and improved efficiency, highlighting the importance of sustained interaction and on-demand visual processing for scalable multimodal agents.

Key Points

  • PyVision-RL framework for reinforcement learning in multimodal models
  • Oversampling-filtering-ranking rollout strategy and accumulative tool reward
  • Application to image and video understanding with improved efficiency

Merits

Improved Efficiency

The PyVision-RL framework demonstrates improved efficiency in video reasoning by selectively sampling task-relevant frames during reasoning, reducing visual token usage.

Sustained Interaction

The framework promotes sustained interaction and multi-turn tool use, preventing interaction collapse and encouraging more effective agentic behavior.

Demerits

Complexity

The PyVision-RL framework may introduce additional complexity in the training pipeline, potentially requiring more computational resources and expertise.

Expert Commentary

The PyVision-RL framework represents a significant advancement in multimodal learning, addressing a critical challenge in reinforcement learning. By promoting sustained interaction and on-demand visual processing, PyVision-RL has the potential to enable more efficient and effective multimodal models. However, the complexity of the framework may require careful consideration and optimization to ensure scalability and applicability in real-world scenarios. Further research is needed to explore the broader implications and applications of PyVision-RL, particularly in areas like human-computer interaction and autonomous systems.

Recommendations

  • Further research on optimizing the PyVision-RL framework for scalability and efficiency
  • Exploration of applications in real-world scenarios, such as robotics and human-computer interaction

Sources