Skip to main content
Academic

RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

arXiv:2602.21628v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundati

arXiv:2602.21628v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a prevailing paradigm for enhancing reasoning in Multimodal Large Language Models (MLLMs). However, relying solely on outcome supervision risks reward hacking, where models learn spurious reasoning patterns to satisfy final answer checks. While recent rubric-based approaches offer fine-grained supervision signals, they suffer from high computational costs of instance-level generation and inefficient training dynamics caused by treating all rubrics as equally learnable. In this paper, we propose Stratified Rubric-based Curriculum Learning (RuCL), a novel framework that reformulates curriculum learning by shifting the focus from data selection to reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence. By dynamically adjusting rubric weights during training, RuCL guides the model from mastering foundational perception to tackling advanced logical reasoning. Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.

Executive Summary

This article introduces Stratified Rubric-Based Curriculum Learning (RuCL), a novel framework that enhances reasoning in Multimodal Large Language Models (MLLMs) by reformulating curriculum learning through reward design. RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence, dynamically adjusting rubric weights during training. Experiments on various visual reasoning benchmarks demonstrate a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%. This breakthrough showcases RuCL's potential to overcome the limitations of outcome supervision and reward hacking in MLLMs.

Key Points

  • RuCL reformulates curriculum learning through reward design to enhance reasoning in MLLMs
  • RuCL generates generalized rubrics for broad applicability and stratifies them based on the model's competence
  • RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model on visual reasoning benchmarks

Merits

Strength in Addressing Reward Hacking

RuCL tackles the issue of reward hacking by generating generalized rubrics and stratifying them based on the model's competence, guiding the model from foundational perception to advanced logical reasoning.

Improved Accuracy on Visual Reasoning Benchmarks

RuCL achieves a state-of-the-art accuracy of 60.06% on visual reasoning benchmarks, demonstrating its potential to overcome the limitations of outcome supervision and reward hacking in MLLMs.

Demerits

Potential Computational Costs of Instance-Level Generation

The high computational costs of instance-level generation may hinder the practical application of RuCL, especially for large-scale MLLMs.

Expert Commentary

The introduction of RuCL marks a significant breakthrough in the development of MLLMs, showcasing its potential to overcome the limitations of outcome supervision and reward hacking. The stratified rubric-based curriculum learning framework enables the model to learn from foundational perception to advanced logical reasoning, achieving state-of-the-art accuracy on visual reasoning benchmarks. This development has far-reaching implications for the practical application of MLLMs, including their potential to improve decision-making in various domains. However, the potential computational costs of instance-level generation remain a concern, requiring further investigation to ensure the practical feasibility of RuCL.

Recommendations

  • Further research is needed to investigate the computational costs of instance-level generation and develop strategies to mitigate them.
  • The development of more advanced and adaptable MLLMs, enabled by RuCL, should be prioritized to ensure their practical application in various domains.

Sources