Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL
arXiv:2603.19470v1 Announce Type: new Abstract: Off-policy problems such as policy staleness and training-inference mismatch, has become a major bottleneck for training stability and further exploration …
Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang
10 views