Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR
arXiv:2602.12642v1 Announce Type: new Abstract: Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this …
Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung
8 views