Academic

Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution

arXiv:2604.03472v1 Announce Type: new Abstract: Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the

J
Jacob Dineen, Aswin RRV, Zhikun Xu, Ben Zhou
· · 1 min read · 9 views

arXiv:2604.03472v1 Announce Type: new Abstract: Co-evolutionary self-play, where one language model generates problems and another solves them, promises autonomous curriculum learning without human supervision. In practice, the proposer quickly converges to a narrow distribution of problems that satisfy the reward function. This diversity collapse renders the curriculum uninformative for the solver, stalling the co-evolutionary loop. We introduce vocabulary dropout, a random mask applied to the proposer's output logits during both policy training and curriculum generation, as a lightweight mechanism to sustain diversity. The mask is hard and non-stationary, preventing the proposer from locking into fixed token sequences. Training Qwen3-4B and Qwen3-8B on mathematical reasoning via R-Zero, we find that vocabulary dropout sustains proposer diversity across lexical, semantic, and functional metrics throughout training, and yields solver improvements averaging +4.4 points at 8B, with the largest gains on competition-level benchmarks. Our findings suggest that explicit action-space constraints, analogous to the structural role that game rules play in classical self-play, can help sustain productive co-evolution in language. Vocabulary dropout is one simple instantiation of this principle.

Executive Summary

The article addresses a critical challenge in co-evolutionary self-play for language models (LLMs), where proposers converge to narrow problem distributions, limiting curriculum diversity and hindering solver improvement. The authors propose 'vocabulary dropout,' a novel technique involving random masking of proposer output logits during training and curriculum generation to sustain diversity across lexical, semantic, and functional dimensions. Applied to mathematical reasoning tasks with Qwen3 models, the method yields an average +4.4-point improvement in solver performance, particularly on competition-level benchmarks. The work underscores the importance of explicit action-space constraints in autonomous curriculum learning, drawing an analogy to game rules in classical self-play. The findings suggest a scalable, lightweight solution to enhance diversity and efficiency in LLM co-evolution.

Key Points

  • Co-evolutionary self-play in LLMs often suffers from 'diversity collapse,' where proposers generate repetitive problems, stalling solver improvement.
  • Vocabulary dropout introduces random, hard, and non-stationary masking of proposer output logits, preventing fixation on narrow token sequences and sustaining diversity.
  • Empirical validation on Qwen3-4B and Qwen3-8B models demonstrates sustained proposer diversity and a +4.4-point average improvement in solver performance, with significant gains on competition-level benchmarks.
  • The method is lightweight and does not require architectural modifications, making it accessible for widespread adoption.
  • The work highlights the structural role of action-space constraints in autonomous curriculum learning, analogous to game rules in classical self-play.

Merits

Novelty and Theoretical Rigor

The introduction of vocabulary dropout as an explicit action-space constraint to sustain diversity in co-evolutionary self-play is a novel contribution. The authors draw a compelling analogy to game rules in classical self-play, grounding their approach in established theoretical frameworks.

Empirical Robustness and Scalability

The method demonstrates consistent performance improvements across models of varying sizes (4B and 8B parameters) and tasks, suggesting scalability and robustness. The +4.4-point average improvement, particularly on competition-level benchmarks, underscores its practical utility.

Simplicity and Accessibility

Vocabulary dropout is a lightweight, architecture-agnostic technique that requires minimal computational overhead or modification to existing pipelines. This simplicity enhances its appeal for immediate adoption in diverse LLM training regimes.

Demerits

Narrow Task Focus

The empirical validation is confined to mathematical reasoning tasks using R-Zero, which may limit the generalizability of the findings. The performance gains and diversity effects may not translate seamlessly to other domains, such as coding, reasoning, or creative tasks.

Hyperparameter Sensitivity

The effectiveness of vocabulary dropout likely depends on the choice of masking parameters (e.g., dropout rate, non-stationarity interval). The article does not extensively explore robustness to these hyperparameters, leaving open questions about optimal configurations.

Lack of Comparative Baselines

While the method shows clear improvements over a baseline (implied to be standard co-evolutionary self-play), the article does not compare vocabulary dropout against alternative diversity-enhancing techniques, such as curriculum learning with human-designed curricula or other exploration strategies.

Expert Commentary

The authors present a compelling solution to a fundamental challenge in co-evolutionary self-play for LLMs. The introduction of vocabulary dropout is both elegant and theoretically grounded, drawing a direct analogy to the structural constraints imposed by game rules in classical self-play. This analogy elevates the work beyond a mere technical contribution, framing it within a broader discourse on structured constraints in AI alignment and governance. The empirical results are robust and suggest that the technique scales with model size, which is critical for real-world adoption. However, the narrow focus on mathematical reasoning tasks leaves open questions about generalizability. Future work should explore the applicability of vocabulary dropout to other domains and compare it against alternative diversity-enhancing techniques. The methodology for evaluating diversity is particularly noteworthy, as it provides a framework for quantifying a critical yet often overlooked aspect of LLM training. Overall, this work represents a significant step forward in autonomous curriculum learning and warrants further investigation.

Recommendations

  • Expand empirical validation beyond mathematical reasoning to domains such as coding, creative writing, and scientific discovery to assess generalizability.
  • Conduct hyperparameter sensitivity analyses to determine optimal configurations for vocabulary dropout, including dropout rates, non-stationarity intervals, and masking strategies.
  • Develop comparative benchmarks against alternative diversity-enhancing techniques, such as reward-shaping, curriculum learning with human-designed curricula, or other exploration strategies, to contextualize the performance gains.
  • Explore the integration of vocabulary dropout with other co-evolutionary techniques, such as adversarial self-play or multi-agent systems, to evaluate its efficacy in more complex training regimes.
  • Investigate the theoretical foundations of vocabulary dropout, including its relationship to exploration strategies in reinforcement learning and its potential to mitigate reward hacking or mode collapse in LLM training.

Sources

Original: arXiv - cs.CL