Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails
arXiv:2603.18280v1 Announce Type: new Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer …
Gregory N. Frank
5 views