Stabilizing Reinforcement Learning for Diffusion Language Models
arXiv:2603.06743v1 Announce Type: new Abstract: Group Relative Policy Optimization (GRPO) is highly effective for post-training autoregressive (AR) language models, yet its direct application to diffusion …
Jianyuan Zhong, Kaibo Wang, Ding Ding, Zijin Feng, Haoli Bai, Yang Xiang, Jiacheng Sun, Qiang Xu
10 views