Pressure, What Pressure? Sycophancy Disentanglement in Language Models via Reward Decomposition
arXiv:2604.05279v1 Announce Type: new Abstract: Large language models exhibit sycophancy, the tendency to shift their stated positions toward perceived user preferences or authority cues regardless …
Muhammad Ahmed Mohsin, Ahsan Bilal, Muhammad Umer, Emily Fox
4 views