Controllable Exploration in Hybrid-Policy RLVR for Multi-Modal Reasoning
arXiv:2602.20197v1 Announce Type: new Abstract: Reinforcement Learning with verifiable rewards (RLVR) has emerged as a primary learning paradigm for enhancing the reasoning capabilities of multi-modal …
Zhuoxu Huang, Mengxi Jia, Hao Sun, Xuelong Li, Jungong Han
6 views