DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
arXiv:2603.01106v1 Announce Type: new Abstract: Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using diffic
arXiv:2603.01106v1 Announce Type: new Abstract: Reinforcement learning (RL) with group relative policy optimization (GRPO) has become a widely adopted approach for enhancing the reasoning capabilities of multimodal large language models (MLLMs). While GRPO enables long-chain reasoning without a critic, it often suffers from sparse rewards on difficult problems and advantage vanishing when group-level rewards are too consistent for overly easy or hard problems. Existing solutions (sample expansion, selective utilization, and indirect reward design) often fail to maintain enough variance in within-group reward distributions to yield clear optimization signals. To address this, we propose DIVA-GRPO, a difficulty-adaptive variant advantage method that adjusts variant difficulty distributions from a global perspective. DIVA-GRPO dynamically assesses problem difficulty, samples variants with appropriate difficulty levels, and calculates advantages across local and global groups using difficulty-weighted and normalized scaling. This alleviates reward sparsity and advantage vanishing while improving training stability. Extensive experiments on six reasoning benchmarks demonstrate that DIVA-GRPO outperforms existing approaches in training efficiency and reasoning performance. Code: https://github.com/Siaaaaaa1/DIVA-GRPO
Executive Summary
This article proposes DIVA-GRPO, a novel variant advantage method that enhances multimodal reasoning capabilities of large language models (LLMs). By dynamically assessing problem difficulty and adjusting variant difficulty distributions, DIVA-GRPO addresses the limitations of existing group relative policy optimization (GRPO) methods, which often suffer from sparse rewards and advantage vanishing. The proposed approach improves training stability and outperforms existing methods on six reasoning benchmarks. While promising, the article highlights the need for further exploration of DIVA-GRPO's generalizability and scalability to more complex tasks. The code release and experimental results demonstrate the approach's potential, but future work should focus on addressing the challenges of adapting DIVA-GRPO to diverse domains and applications.
Key Points
- ▸ DIVA-GRPO proposes a difficulty-adaptive variant advantage method for multimodal LLMs
- ▸ The approach dynamically assesses problem difficulty and adjusts variant difficulty distributions
- ▸ DIVA-GRPO improves training stability and outperforms existing GRPO methods on six reasoning benchmarks
Merits
Strength in addressing reward sparsity and advantage vanishing
DIVA-GRPO effectively alleviates reward sparsity and advantage vanishing by adjusting variant difficulty distributions, leading to improved training stability and performance.
Improvements in training efficiency and reasoning performance
Experimental results demonstrate that DIVA-GRPO outperforms existing GRPO methods in training efficiency and reasoning performance on six reasoning benchmarks.
Demerits
Limited exploration of generalizability and scalability
The article primarily focuses on DIVA-GRPO's performance on six reasoning benchmarks and does not extensively explore its generalizability and scalability to more complex tasks and diverse domains.
Dependence on problem difficulty assessment
The effectiveness of DIVA-GRPO relies on accurate problem difficulty assessment, which may be challenging to achieve in complex or dynamic environments.
Expert Commentary
The article presents a promising approach to addressing the limitations of existing GRPO methods, and the proposed DIVA-GRPO method demonstrates impressive performance on six reasoning benchmarks. However, further exploration of the approach's generalizability and scalability is necessary to fully understand its potential. Additionally, the article highlights the need for ongoing research in multimodal reasoning and large language models, as well as the development of variant advantage methods. The code release and experimental results provide a solid foundation for future work, and the article's findings have implications for both practical applications and policy considerations.
Recommendations
- ✓ Future research should focus on adapting DIVA-GRPO to diverse domains and applications, including more complex tasks and dynamic environments.
- ✓ The development of variant advantage methods, such as DIVA-GRPO, should be continued and expanded to address the limitations of existing GRPO approaches.