Skip to main content
Academic

UpSkill: Mutual Information Skill Learning for Structured Response Diversity in LLMs

arXiv:2602.22296v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without de

arXiv:2602.22296v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of large language models (LLMs) on mathematics and programming tasks, but standard approaches that optimize single-attempt accuracy can inadvertently suppress response diversity across repeated attempts, narrowing exploration and overlooking underrepresented strategies. We introduce UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness. We propose a novel reward that we implement within Group Relative Policy Optimization (GRPO): a token-level mutual information (MI) reward that encourages trajectory specificity to z. Experiments on GSM8K with three open-weight models, Llama 3.1-8B, Qwen 2.5-7B, and R1-Distilled-Qwen2.5-Math-1.5B, show that UpSkill improves multi-attempt metrics on the stronger base models, yielding mean gains of ~3% in pass@k for both Qwen and Llama without degrading pass@1. Additionally, we find both empirical and theoretical evidence that improvements in pass@k are closely tied to the mutual information objective.

Executive Summary

This article presents UpSkill, a training time method that adapts Mutual Information Skill Learning (MISL) to Large Language Models (LLMs) for optimizing pass@k correctness. The authors propose a novel reward that encourages trajectory specificity, resulting in improved multi-attempt metrics on stronger base models. The findings demonstrate a significant mean gain of ~3% in pass@k for both Qwen and Llama, without degrading pass@1. The research highlights the importance of response diversity in LLMs, and the use of mutual information objectives to achieve this. The authors provide empirical and theoretical evidence supporting their claims, solidifying the significance of their contribution.

Key Points

  • UpSkill adapts Mutual Information Skill Learning (MISL) to LLMs for optimizing pass@k correctness.
  • Proposes a novel token-level mutual information (MI) reward within Group Relative Policy Optimization (GRPO).
  • Improves multi-attempt metrics on stronger base models, yielding a mean gain of ~3% in pass@k for Qwen and Llama.

Merits

Strength in Response Diversity

The authors effectively address the issue of response diversity in LLMs, which is a significant limitation of standard approaches that optimize single-attempt accuracy.

Methodological Contributions

The novel reward mechanism proposed within GRPO is a significant methodological contribution, demonstrating the potential for MISL in LLMs.

Demerits

Scalability Concerns

The computational requirements of implementing MISL in LLMs may be substantial, potentially limiting its scalability to more complex models.

Evaluation Metrics

The authors rely on a narrow set of evaluation metrics (pass@k, pass@1), which may not capture the full range of benefits offered by UpSkill.

Expert Commentary

The authors' work demonstrates a compelling case for the use of mutual information objectives in LLMs, highlighting the potential for improved response diversity and pass@k correctness. The methodological contributions of this research are significant, particularly the novel reward mechanism proposed within GRPO. While there are potential scalability concerns, the authors' focus on stronger base models suggests a promising pathway forward. As the field of LLMs continues to evolve, it is essential to prioritize research that addresses the limitations of existing approaches, such as the suppression of response diversity. The findings of this research offer a critical step in this direction, and its implications are likely to be felt across a range of applications, from natural language processing to language translation.

Recommendations

  • Future research should prioritize the development of more efficient and scalable methods for implementing MISL in LLMs.
  • The evaluation of UpSkill on a broader range of LLMs and applications is necessary to fully understand its potential benefits and limitations.

Sources