Improving Robustness In Sparse Autoencoders via Masked Regularization
arXiv:2604.06495v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone …
Vivek Narayanaswamy, Kowshik Thopalli, Bhavya Kailkhura, Wesam Sakla
18 views