Do Post-Training Algorithms Actually Differ? A Controlled Study Across Model Scales Uncovers Scale-Dependent Ranking Inversions
arXiv:2603.19335v1 Announce Type: new Abstract: Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling $\sim$240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~$\pm$0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2$\times$2 factorial). (2)~Loss function modifications yield negligible gains
arXiv:2603.19335v1 Announce Type: new Abstract: Post-training alignment has produced dozens of competing algorithms -- DPO, SimPO, KTO, GRPO, and others -- yet practitioners lack controlled comparisons to guide algorithm selection. We present OXRL, a unified framework implementing 51 post-training algorithms with identical infrastructure, enabling the first large-scale apples-to-apples evaluation. Our study spans 8 algorithms across 4 model scales (0.5B--7B), 3 evaluation domains, and a 20-variant DPO taxonomy (100 runs at 1.5B, 5 seeds each), totaling $\sim$240 training runs on H100 GPUs. Three headline findings emerge. (1)~Algorithm rankings are unstable across scale: at 1.5B, online RL (SGRPO) tops all methods at 58.0\%~$\pm$0.57 on GSM8K; by 7B, the worst small-scale method (SimPO) becomes the best (85.8\%), a complete ranking inversion driven by model scale rather than LoRA regularization (confirmed via 2$\times$2 factorial). (2)~Loss function modifications yield negligible gains: none of 20 DPO variants significantly outperform vanilla DPO after Bonferroni correction; the sole significant outlier, SimPO, is worse ($-$11.5~pp, $p < 10^{-4}$). (3)~Algorithm leverage is task-specific: the 19.3~pp GSM8K spread collapses to 0.54~pp on MATH ($36\times$) and 0.47~pp on general-domain benchmarks ($41\times$), confirming that algorithm choice matters primarily within the training distribution. These findings yield a hierarchy of leverage for practitioners: model scale (${\sim}$50~pp) $\gg$ training paradigm (${\sim}$10~pp) $\gg$ online vs.\ offline (${\sim}$9~pp) $\gg$ loss function (${\sim}$1~pp). We release all code, configs, and evaluation data as a living community benchmark.
Executive Summary
This article presents a comprehensive study comparing 51 post-training algorithms across various model scales, evaluation domains, and training paradigms. The results reveal that algorithm rankings are unstable across scale, and model scale has a significant impact on performance, far exceeding the influence of loss function modifications or online/offline training. The findings suggest that practitioners should prioritize model scale, followed by training paradigm, online vs. offline training, and loss function. The study's methodology and results provide valuable insights for the development and selection of post-training algorithms, and the authors' release of all code, configs, and evaluation data as a living community benchmark is a significant contribution to the field.
Key Points
- ▸ Algorithm rankings are unstable across scale, with significant performance differences between small and large models.
- ▸ Model scale has a major impact on performance, exceeding the influence of loss function modifications or online/offline training.
- ▸ Task-specificity of algorithm leverage is confirmed, with algorithm choice mattering primarily within the training distribution.
Merits
Comprehensive Comparison
The study presents a large-scale comparison of 51 post-training algorithms across various model scales, evaluation domains, and training paradigms, providing a comprehensive understanding of the impact of different factors on algorithm performance.
Methodological Rigor
The authors employ a rigorous methodology, including a unified framework, identical infrastructure, and a 20-variant DPO taxonomy, ensuring that the results are reliable and generalizable.
Living Community Benchmark
The release of all code, configs, and evaluation data as a living community benchmark is a significant contribution to the field, enabling future researchers to build upon the study's findings and further develop post-training algorithms.
Demerits
Limited Generalizability
The study's focus on a specific set of post-training algorithms and model scales may limit the generalizability of the results to other algorithms and domains.
Computational Resource Intensity
The study's large-scale evaluation, totaling ~240 training runs on H100 GPUs, may be computationally resource-intensive, limiting its replicability and applicability to researchers with limited resources.
Expert Commentary
This study provides a significant contribution to the field of post-training algorithms, highlighting the importance of considering model scale, training paradigm, and online/offline training when selecting algorithms. The results have implications for both practitioners and policymakers, emphasizing the need for further research on explainability and transfer learning in deep learning. However, the study's focus on a specific set of algorithms and model scales may limit its generalizability, and the computational resource intensity of the evaluation may limit its replicability and applicability.
Recommendations
- ✓ Future researchers should focus on developing explainable post-training algorithms that can provide insights into their behavior across different model scales and training paradigms.
- ✓ Researchers should investigate the transferability of post-training algorithms across different domains and applications, considering the impact of model scale and training paradigm on performance.
Sources
Original: arXiv - cs.LG