Discovering Implicit Large Language Model Alignment Objectives
arXiv:2602.15338v1 Announce Type: cross Abstract: Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical …
Edward Chen, Sanmi Koyejo, Carlos Guestrin
7 views