Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment
arXiv:2603.06727v1 Announce Type: new Abstract: Current safety alignment methods encode safe behavior implicitly within model parameters, creating a fundamental opacity: we cannot easily inspect why …
Jingyuan Feng, Andrew Gambardella, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
8 views