To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models
arXiv:2602.12566v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual
arXiv:2602.12566v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, and instruction following) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, model prediction behavior, and information constraints. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/mosAI25/M2RL
Executive Summary
The article 'To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models' explores the effectiveness of two training paradigms for multi-domain Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models (LLMs). The study compares mixed multi-task RLVR and separate RLVR followed by model merging, using high-level tasks such as math, coding, science, and instruction following. The research finds minimal mutual interference across domains and synergistic effects in reasoning-intensive domains, analyzing these findings through weight space geometry, model prediction behavior, and information constraints. The project, named M2RL, aims to provide insights into optimizing multi-domain training strategies for LLMs.
Key Points
- ▸ Comparison of mixed multi-task RLVR and separate RLVR followed by model merging.
- ▸ Minimal mutual interference across domains in multi-domain RLVR.
- ▸ Synergistic effects observed in reasoning-intensive domains.
- ▸ Analysis of internal mechanisms through weight space geometry, model prediction behavior, and information constraints.
Merits
Comprehensive Comparison
The study provides a detailed and rigorous comparison of two prevalent training paradigms in multi-domain RLVR, addressing a gap in the current literature.
Extensive Experimental Design
The research employs a wide range of high-level tasks and open-source datasets, ensuring the robustness and generalizability of the findings.
Mechanistic Insights
The analysis of mutual gains from multiple perspectives offers valuable insights into the underlying mechanisms of multi-domain RLVR.
Demerits
Limited Scope of Domains
The study focuses on a specific set of high-level tasks, which may not fully represent the diversity of potential domains in LLMs.
Potential Bias in Datasets
The use of open-source datasets may introduce biases that could affect the generalizability of the results.
Complexity of Analysis
The detailed analysis of internal mechanisms, while insightful, may be complex and challenging to replicate or build upon.
Expert Commentary
The article presents a well-structured and thorough investigation into the comparative effectiveness of mixed multi-task RLVR and separate RLVR followed by model merging. The study's rigorous experimental design and comprehensive analysis provide valuable insights into the dynamics of multi-domain training in LLMs. The findings of minimal mutual interference and synergistic effects in reasoning-intensive domains are particularly noteworthy, as they challenge some conventional assumptions about multi-task learning. The mechanistic analysis, while complex, offers a deeper understanding of the underlying processes, which is crucial for advancing the field. However, the study's focus on a specific set of high-level tasks and the potential biases in open-source datasets warrant caution in generalizing the results. Overall, this research makes a significant contribution to the literature on multi-domain RLVR and provides practical guidance for both researchers and practitioners in the field of AI.
Recommendations
- ✓ Future research should expand the scope of domains to include a more diverse range of tasks, ensuring the generalizability of the findings.
- ✓ Investigation into the potential biases in open-source datasets and the development of methods to mitigate these biases would enhance the robustness of the results.