Sparse Autoencoders are Capable LLM Jailbreak Mitigators
arXiv:2602.12418v1 Announce Type: cross Abstract: Jailbreak attacks remain a persistent threat to large language model safety. We propose Context-Conditioned Delta Steering (CC-Delta), an SAE-based defense …
Yannick Assogba, Jacopo Cortellazzi, Javier Abad, Pau Rodriguez, Xavier Suau, Arno Blaas
16 views