Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs
arXiv:2604.06298v1 Announce Type: new Abstract: Recent alignment work on Large Language Models (LLMs) suggests preference optimization can improve reasoning by shifting probability mass toward better …
Suraj Yadav, Siddharth Yadav, Parth Goyal
19 views