Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
arXiv:2603.04426v1 Announce Type: new Abstract: Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our
arXiv:2603.04426v1 Announce Type: new Abstract: Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.
Executive Summary
The article introduces Delta-Crosscoder, a novel approach to model diffing that effectively identifies changes in a model's internal representations after fine-tuning. By combining sparsity and a delta-based loss, Delta-Crosscoder outperforms existing methods in isolating latent directions responsible for fine-tuned behaviors. The approach is evaluated across 10 model organisms and demonstrates promising results in mitigating unwanted behaviors. The study highlights the potential of crosscoders in model diffing, particularly in narrow fine-tuning regimes where existing methods struggle.
Key Points
- ▸ Delta-Crosscoder combines BatchTopK sparsity with a delta-based loss to identify changes in model representations
- ▸ The approach prioritizes directions that change between models and incorporates an implicit contrastive signal
- ▸ Delta-Crosscoder outperforms SAE-based baselines and matches Non-SAE-based methods in evaluating 10 model organisms
Merits
Robustness to Narrow Fine-Tuning
Delta-Crosscoder's ability to handle narrow fine-tuning regimes where behavioral changes are localized and asymmetric is a significant strength
Demerits
Limited Applicability to Other Domains
The study's focus on model diffing in narrow fine-tuning regimes may limit the generalizability of Delta-Crosscoder to other domains or applications
Expert Commentary
The introduction of Delta-Crosscoder marks a significant advancement in model diffing, particularly in the context of narrow fine-tuning regimes. The approach's ability to isolate latent directions causally responsible for fine-tuned behaviors has important implications for the development of more transparent and explainable AI models. However, further research is needed to fully explore the potential of Delta-Crosscoder and its applications in various domains. The study's findings also underscore the need for ongoing investment in explainability and transparency research, as well as the development of regulatory frameworks that prioritize these values.
Recommendations
- ✓ Future studies should investigate the applicability of Delta-Crosscoder to other domains and applications, such as natural language processing and computer vision
- ✓ Researchers and practitioners should consider integrating Delta-Crosscoder into their model diffing workflows to improve the transparency and explainability of AI models