Reinforcement Learning from Multi-Source Imperfect Preferences: Best-of-Both-Regimes Regret
arXiv:2603.20453v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) replaces hard-to-specify rewards with pairwise trajectory preferences, yet regret-oriented theory often assumes that preference …
Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami
6 views