Academic

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

arXiv:2603.04426v1 Announce Type: new Abstract: Model diffing methods aim to identify how fine-tuning changes a model's internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our

Aly Kassem, Thomas Jiralerspong, Negar Rostamzadeh, Golnoosh Farnadi · March 7, 2026 · 1 min read · 18 views

#cs.LG #cs.AI

Executive Summary

The article introduces Delta-Crosscoder, a novel approach to model diffing that effectively identifies changes in a model's internal representations after fine-tuning. By combining sparsity and a delta-based loss, Delta-Crosscoder outperforms existing methods in isolating latent directions responsible for fine-tuned behaviors. The approach is evaluated across 10 model organisms and demonstrates promising results in mitigating unwanted behaviors. The study highlights the potential of crosscoders in model diffing, particularly in narrow fine-tuning regimes where existing methods struggle.

Key Points

▸ Delta-Crosscoder combines BatchTopK sparsity with a delta-based loss to identify changes in model representations
▸ The approach prioritizes directions that change between models and incorporates an implicit contrastive signal
▸ Delta-Crosscoder outperforms SAE-based baselines and matches Non-SAE-based methods in evaluating 10 model organisms

Merits

Robustness to Narrow Fine-Tuning

Delta-Crosscoder's ability to handle narrow fine-tuning regimes where behavioral changes are localized and asymmetric is a significant strength

Demerits

Limited Applicability to Other Domains

The study's focus on model diffing in narrow fine-tuning regimes may limit the generalizability of Delta-Crosscoder to other domains or applications

Expert Commentary

The introduction of Delta-Crosscoder marks a significant advancement in model diffing, particularly in the context of narrow fine-tuning regimes. The approach's ability to isolate latent directions causally responsible for fine-tuned behaviors has important implications for the development of more transparent and explainable AI models. However, further research is needed to fully explore the potential of Delta-Crosscoder and its applications in various domains. The study's findings also underscore the need for ongoing investment in explainability and transparency research, as well as the development of regulatory frameworks that prioritize these values.

Recommendations

✓ Future studies should investigate the applicability of Delta-Crosscoder to other domains and applications, such as natural language processing and computer vision
✓ Researchers and practitioners should consider integrating Delta-Crosscoder into their model diffing workflows to improve the transparency and explainability of AI models

Sources

arXiv - cs.LG

Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes

AI Commentary

Executive Summary

Key Points

Merits

Robustness to Narrow Fine-Tuning

Demerits

Limited Applicability to Other Domains

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs