Learning Ordinal Probabilistic Reward from Preferences
arXiv:2602.12660v1 Announce Type: new Abstract: Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative …
Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, Min Yang
3 views