Training Agents to Self-Report Misbehavior
arXiv:2602.22303v1 Announce Type: new Abstract: Frontier AI agents may pursue hidden goals while concealing their pursuit from oversight. Alignment training aims to prevent such behavior …
Bruce W. Lee, Chen Yueh-Han, Tomek Korbak
6 views