Academic

PRISM: Demystifying Retention and Interaction in Mid-Training

arXiv:2603.17074v1 Announce Type: new Abstract: We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond g

B
Bharat Runwal, Ashish Agrawal, Anurag Roy, Rameswar Panda
· · 1 min read · 5 views

arXiv:2603.17074v1 Announce Type: new Abstract: We present PRISM, a comprehensive empirical study of mid-training design choices for large language models. Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-Mamba hybrid), and scales from 3B to 24B parameters, we show that mid-training on approximately 27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science benchmarks while preserving general performance. The full PRISM to RL pipeline improves macro-average across six reasoning benchmarks from under 12 to 29-42 (a 3-4x improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition matters most at mid-training, not RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces less than 2 point differences. Mechanistically, mid-training densely restructures over 90% of model weights, while RL makes sparse, front-loaded refinements to approximately 5% of parameters. Representation analysis (CKA) confirms that RL consistently preserves mid-training's representational geometry (over 0.998 CKA) across architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is highly effective for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.

Executive Summary

The article presents PRISM, a comprehensive study on mid-training design choices for large language models. The research demonstrates that mid-training on high-quality tokens yields significant gains on various benchmarks, including math, code, and science. The study emphasizes the importance of data composition during mid-training and highlights the effectiveness of retention-aware mid-training in enhancing reliable reasoning. The results provide practical guidance for designing robust mid-training pipelines, which can be applied in various applications, including natural language processing and artificial intelligence. Overall, the study contributes significantly to the field of large language models and provides valuable insights for researchers and practitioners.

Key Points

  • PRISM presents a comprehensive empirical study of mid-training design choices for large language models.
  • Mid-training on high-quality tokens yields significant gains on various benchmarks, including math, code, and science.
  • Data composition matters most at mid-training, not reinforcement learning (RL): including science data during mid-training unlocks significant gains during RL.

Merits

Strength

The study provides a comprehensive and systematic approach to investigating mid-training design choices, which is a significant contribution to the field of large language models. The use of controlled experiments and various base models and architectures adds to the study's validity and generalizability.

Strength

The study highlights the importance of data composition during mid-training, which is a crucial factor in determining the effectiveness of mid-training pipelines.

Strength

The study provides practical guidance for designing robust mid-training pipelines, which can be applied in various applications, including natural language processing and artificial intelligence.

Demerits

Limitation

The study focuses primarily on large language models and may not be directly applicable to other types of machine learning models. Further research is needed to investigate the generalizability of the findings to other domains.

Limitation

The study relies on a limited number of benchmarks, which may not represent the full range of possible applications and use cases for large language models.

Limitation

The study does not investigate the potential risks and challenges associated with mid-training pipelines, such as the potential for overfitting or the need for significant computational resources.

Expert Commentary

The study presents a comprehensive and systematic approach to investigating mid-training design choices, which is a significant contribution to the field of large language models. The use of controlled experiments and various base models and architectures adds to the study's validity and generalizability. However, the study's focus on large language models and the potential limitations of mid-training pipelines suggest that further research is needed to fully understand the implications of the findings. The study's emphasis on the importance of data composition during mid-training and the use of RL as a means of refining mid-trained models highlight the potential for mid-training to improve the performance of large language models and the need for further research in this area.

Recommendations

  • Further research is needed to investigate the generalizability of the findings to other domains and to explore the potential risks and challenges associated with mid-training pipelines.
  • Researchers should investigate the use of mid-training pipelines in other applications, such as computer vision and speech recognition, to better understand the potential benefits and limitations of this approach.
  • Developers should consider the importance of data composition during mid-training when designing and deploying large language models, and should explore the use of RL as a means of refining mid-trained models to improve performance.

Sources