Academic

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

arXiv:2602.23197v1 Announce Type: new Abstract: Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot

Chungpa Lee, Jy-yong Sohn, Kangwook Lee · February 28, 2026 · 1 min read · 5 views

#cs.CL #cs.LG #stat.ML

Executive Summary

This article presents a theoretical analysis of linear attention models in the context of fine-tuning large language models for in-context learning. The authors demonstrate that fine-tuning all attention parameters can harm in-context learning, while restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. They also show that incorporating an auxiliary few-shot loss enhances in-context learning on the target task at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. The study empirically validates its theoretical results, providing new insights into the relationship between fine-tuning objectives and attention parameters.

Key Points

▸ Fine-tuning all attention parameters can degrade in-context learning
▸ Restricting updates to the value matrix preserves in-context learning while improving zero-shot performance
▸ Incorporating an auxiliary few-shot loss enhances in-context learning on the target task at the expense of other tasks

Merits

Strength in theoretical analysis

The study provides a comprehensive theoretical analysis of linear attention models, shedding new light on the relationship between fine-tuning objectives and attention parameters.

Empirical validation

The study empirically validates its theoretical results, providing strong evidence for its conclusions.

Demerits

Limited scope

The study focuses exclusively on linear attention models and may not generalize to other types of attention mechanisms.

Assumes idealized fine-tuning scenario

The study assumes an idealized fine-tuning scenario, which may not reflect real-world fine-tuning practices.

Expert Commentary

This study provides a valuable contribution to the literature on attention mechanisms in natural language processing. The authors' theoretical analysis and empirical validation provide strong evidence for their conclusions, and the study's findings have important implications for the design of attention-based models. However, the study's limited scope and idealized fine-tuning scenario may limit its generalizability to real-world applications. Nevertheless, the study's results are a significant advancement in our understanding of attention mechanisms and have the potential to inform policy decisions related to the use of large language models.

Recommendations

✓ Future studies should investigate the generalizability of the study's findings to other types of attention mechanisms.
✓ Researchers should explore the implications of the study's results for real-world fine-tuning practices.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

AI Commentary

Executive Summary

Key Points

Merits

Strength in theoretical analysis

Empirical validation

Demerits

Limited scope

Assumes idealized fine-tuning scenario

Expert Commentary

Recommendations

Sources

Related Articles

Budget-Aware Agentic Routing via Boundary-Guided Training

ImpRIF: Stronger Implicit Reasoning Leads to Better Complex Instruction Following

ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision …

Urban Vibrancy Embedding and Application on Traffic Prediction

JCG, PC

HSOLLC Co., Ltd.