The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
arXiv:2602.15515v1 Announce Type: new Abstract: Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks …
Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave, Chris Cundy
4 views