Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
arXiv:2602.21346v1 Announce Type: cross Abstract: Recent advances in alignment techniques such as Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), and Direct Preference Optimization …
Mengxuan Hu, Vivek V. Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu
6 views