Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
arXiv:2604.06298v1 Announce Type: new Abstract: Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO, exceeding it by ~5% at 1.5B and by ~3% at 3B. We sho
arXiv:2604.06298v1 Announce Type: new Abstract: Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better solutions. We test this claim in a resource-constrained setting by applying GRPO with LoRA to SLMs (up to 3B) for math reasoning on GSM8K and MATH datasets with difficulty-stratified analyses. As problem difficulty increases, accuracy plateaus, revealing a capacity boundary: GRPO primarily reshapes output preferences without reliably improving hardest-tier solving. Consistent with this, training GRPO only on lower-difficulty problems matches full-dataset accuracy across difficulty tiers while using only ~45% training steps, indicating diminishing returns from harder samples in this regime. We also find a cross-dataset generalization effect: GSM8K-trained GRPO achieves higher accuracy on the numeric subset of MATH than MATH-trained GRPO, exceeding it by ~5% at 1.5B and by ~3% at 3B. We show that the best achievable gains depend strongly on the base model's prior reasoning competence and the dataset's difficulty profile.
Executive Summary
This article, "Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs," critically examines the efficacy of preference optimization, specifically GRPO with LoRA, in enhancing reasoning capabilities of Small Language Models (SLMs, up to 3B parameters) on math reasoning tasks (GSM8K and MATH datasets). The authors find that while preference optimization can improve performance on easier problems, accuracy plateaus as problem difficulty increases, suggesting a fundamental capacity boundary for SLMs. Training GRPO solely on lower-difficulty problems achieves comparable accuracy to full-dataset training with significantly reduced computational cost. The study also uncovers interesting cross-dataset generalization, with GSM8K-trained GRPO outperforming MATH-trained GRPO on the numeric subset of MATH. Ultimately, the work highlights that the effectiveness of preference optimization is highly dependent on the base model's inherent competence and the dataset's difficulty distribution.
Key Points
- ▸ GRPO with LoRA on SLMs (up to 3B) for math reasoning shows diminishing returns from hard samples.
- ▸ Accuracy plateaus on harder problems, indicating a capacity boundary for SLMs, where GRPO primarily reshapes output preferences rather than fundamentally improving solving ability.
- ▸ Training GRPO only on lower-difficulty problems yields comparable accuracy to full-dataset training while significantly reducing computational steps (~45%).
- ▸ A notable cross-dataset generalization effect exists: GSM8K-trained GRPO outperforms MATH-trained GRPO on the numeric subset of MATH.
- ▸ Achievable gains from preference optimization are strongly contingent on the base model's prior reasoning competence and the dataset's difficulty profile.
Merits
Rigorous Difficulty Stratification
The use of difficulty-stratified analyses provides granular insights into the performance limitations of SLMs under preference optimization, moving beyond aggregate metrics.
Computational Efficiency Insight
Identifying that training on easier samples can achieve comparable results with fewer steps is a significant contribution to resource optimization in SLM alignment.
Novel Cross-Dataset Generalization
The discovery of GSM8K-trained GRPO outperforming MATH-trained GRPO on specific subsets is an intriguing finding that warrants further investigation into dataset characteristics and transfer learning.
Clear Capacity Boundary Delineation
The article effectively demonstrates a 'capacity boundary' for SLMs, emphasizing that preference optimization isn't a panacea for inherent model limitations in complex reasoning.
Demerits
Limited Scope of SLMs
The findings are restricted to SLMs (up to 3B parameters). While valuable, the generalizability to larger LLMs, where capacity boundaries might be different or higher, is not directly addressed.
Specific Task Domain
The focus solely on math reasoning (GSM8K, MATH) might limit the universal applicability of 'diminishing returns' for hard samples across other complex reasoning domains (e.g., legal, medical, scientific).
Mechanism of 'Reshaping Preferences' Unexplored
While stating GRPO 'reshapes output preferences,' the article doesn't delve deeply into the precise mechanisms or cognitive shifts (if any) occurring within the model, beyond probability mass redistribution.
Absence of Comparison to Other Alignment Methods
The study focuses exclusively on GRPO. A comparative analysis with other preference optimization techniques or reinforcement learning from human feedback (RLHF) variants could provide broader context.
Expert Commentary
This article offers a timely and incisive contribution to the burgeoning literature on LLM alignment, particularly for the often-overlooked small language models. The central finding that 'hard samples yield diminishing returns' for GRPO-tuned SLMs is not merely an empirical observation but a profound insight into the fundamental architectural and parametric limitations of these models. It suggests that preference optimization, while effective at refining output distributions, cannot fundamentally imbue a model with reasoning capabilities it inherently lacks. The distinction between 'reshaping preferences' and 'improving solving' is critical and hints at the need for architectural innovations or much larger scale to truly tackle extreme difficulty. The efficiency gains from training on easier samples are a practical triumph, offering a blueprint for sustainable AI development in resource-constrained settings. This work compels a re-evaluation of data curation strategies and sets a realistic expectation for SLM performance, moving beyond the 'bigger is always better' paradigm.
Recommendations
- ✓ Future research should investigate the generalizability of these findings to other complex reasoning domains (e.g., scientific discovery, legal analysis) and different types of alignment algorithms beyond GRPO.
- ✓ Explore the 'why' behind the cross-dataset generalization effect, delving into the specific features or reasoning patterns learned from GSM8K that transfer effectively to MATH's numeric subset.
- ✓ Conduct ablation studies to dissect the precise contribution of LoRA in this GRPO-tuned SLM setup, and compare against full fine-tuning or other parameter-efficient methods.
- ✓ Investigate whether hybrid alignment strategies (e.g., combining GRPO with knowledge distillation or symbolic reasoning modules) could overcome the identified capacity boundary for harder problems in SLMs.
Sources
Original: arXiv - cs.LG