When Do LLM Preferences Predict Downstream Behavior?
arXiv:2602.18971v1 Announce Type: new Abstract: Preference-driven behavior in LLMs may be a necessary precondition for AI misalignment such as sandbagging: models cannot strategically pursue misaligned …
Katarina Slama, Alexandra Souly, Dishank Bansal, Henry Davidson, Christopher Summerfield, Lennart Luettgau
3 views