Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
arXiv:2602.23197v1 Announce Type: new Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot
arXiv:2602.23197v1 Announce Type: new Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We empirically validate our theoretical results.
Executive Summary
This article presents a theoretical analysis of linear attention models in the context of fine-tuning large language models for in-context learning. The authors demonstrate that fine-tuning all attention parameters can harm in-context learning, while restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. They also show that incorporating an auxiliary few-shot loss enhances in-context learning on the target task at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. The study empirically validates its theoretical results, providing new insights into the relationship between fine-tuning objectives and attention parameters.
Key Points
- ▸ Fine-tuning all attention parameters can degrade in-context learning
- ▸ Restricting updates to the value matrix preserves in-context learning while improving zero-shot performance
- ▸ Incorporating an auxiliary few-shot loss enhances in-context learning on the target task at the expense of other tasks
Merits
Strength in theoretical analysis
The study provides a comprehensive theoretical analysis of linear attention models, shedding new light on the relationship between fine-tuning objectives and attention parameters.
Empirical validation
The study empirically validates its theoretical results, providing strong evidence for its conclusions.
Demerits
Limited scope
The study focuses exclusively on linear attention models and may not generalize to other types of attention mechanisms.
Assumes idealized fine-tuning scenario
The study assumes an idealized fine-tuning scenario, which may not reflect real-world fine-tuning practices.
Expert Commentary
This study provides a valuable contribution to the literature on attention mechanisms in natural language processing. The authors' theoretical analysis and empirical validation provide strong evidence for their conclusions, and the study's findings have important implications for the design of attention-based models. However, the study's limited scope and idealized fine-tuning scenario may limit its generalizability to real-world applications. Nevertheless, the study's results are a significant advancement in our understanding of attention mechanisms and have the potential to inform policy decisions related to the use of large language models.
Recommendations
- ✓ Future studies should investigate the generalizability of the study's findings to other types of attention mechanisms.
- ✓ Researchers should explore the implications of the study's results for real-world fine-tuning practices.