Three Concrete Challenges and Two Hopes for the Safety of Unsupervised Elicitation
arXiv:2602.20400v1 Announce Type: new Abstract: To steer language models towards truthful outputs on tasks which are beyond human capability, previous work has suggested training models …
Callum Canavan, Aditya Shrivastava, Allison Qi, Jonathan Michala, Fabien Roger
4 views