SAE as a Crystal Ball: Interpretable Features Predict Cross-domain Transferability of LLMs without Training
arXiv:2603.02908v1 Announce Type: new Abstract: In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models a
arXiv:2603.02908v1 Announce Type: new Abstract: In recent years, pre-trained large language models have achieved remarkable success across diverse tasks. Besides the pivotal role of self-supervised pre-training, their effectiveness in downstream applications also depends critically on the post-training process, which adapts models to task-specific data and objectives. However, this process inevitably introduces model shifts that can influence performance in different domains, and how such shifts transfer remains poorly understood. To open up the black box, we propose the SAE-based Transferability Score (STS), a new metric that leverages sparse autoencoders (SAEs) to forecast post-training transferability. Taking supervised fine-tuning as an example, STS identifies shifted dimensions in SAE representations and calculates their correlations with downstream domains, enabling reliable estimation of transferability \textit{before} fine-tuning. Extensive experiments across multiple models and domains show that STS accurately predicts the transferability of supervised fine-tuning, achieving Pearson correlation coefficients above 0.7 with actual performance changes. Beyond this, we take an initial step toward extending STS to reinforcement learning. We believe that STS can serve as an {\color{black} interpretable} tool for guiding post-training strategies in LLMs. Code is available at https://github.com/PKU-ML/STS.
Executive Summary
This article proposes the SAE-based Transferability Score (STS), a novel metric that utilizes sparse autoencoders to predict the post-training transferability of pre-trained large language models (LLMs) in diverse downstream applications. By analyzing shifted dimensions in SAE representations and their correlations with target domains, STS enables reliable estimation of transferability before fine-tuning. The authors demonstrate the effectiveness of STS through extensive experiments across multiple models and domains, achieving high Pearson correlation coefficients with actual performance changes. This work has significant implications for optimizing post-training strategies in LLMs and sheds light on the complex relationships between model shifts and domain transferability. The authors' ambition to extend STS to reinforcement learning and its potential to serve as an interpretable tool for guiding post-training strategies make this research particularly noteworthy.
Key Points
- ▸ The SAE-based Transferability Score (STS) is proposed as a novel metric for predicting post-training transferability in LLMs.
- ▸ STS leverages sparse autoencoders to identify shifted dimensions and calculate correlations with downstream domains.
- ▸ Extensive experiments demonstrate the effectiveness of STS in predicting transferability across multiple models and domains.
Merits
Strength
The proposed STS metric offers a novel and interpretable approach to understanding model transferability, which is crucial for optimizing post-training strategies in LLMs.
Demerits
Limitation
The authors primarily focus on supervised fine-tuning and take an initial step toward extending STS to reinforcement learning, but a more comprehensive evaluation of STS in various training paradigms is warranted.
Expert Commentary
The article presents a timely and thought-provoking contribution to the field of LLMs and transfer learning. The proposed STS metric offers a promising approach to understanding model transferability, which is essential for optimizing post-training strategies in diverse downstream applications. While the authors' focus on supervised fine-tuning is understandable, further evaluation of STS in various training paradigms, including reinforcement learning, is necessary to demonstrate its generalizability and robustness. Nevertheless, the authors' ambition to develop an interpretable tool for guiding post-training strategies in LLMs is laudable and warrants further investigation. The implications of this research are significant, and we can expect to see more attention devoted to model interpretability and transfer learning in the coming years.
Recommendations
- ✓ Future research should aim to evaluate STS in various training paradigms, including reinforcement learning, to demonstrate its generalizability and robustness.
- ✓ The authors should explore the potential applications of STS in real-world scenarios, such as optimizing post-training strategies in LLMs for natural language processing tasks or computer vision applications.