SiMPO: Measure Matching for Online Diffusion Reinforcement Learning
arXiv:2603.10250v1 Announce Type: new Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces …
Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai
15 views