Academic

Improving Sparse Memory Finetuning

arXiv:2604.05248v1 Announce Type: new Abstract: Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally "surprising" to

S
Satyam Goyal, Anirudh Kanchi, Garv Shah, Prakhar Gupta
· · 1 min read · 5 views

arXiv:2604.05248v1 Announce Type: new Abstract: Large Language Models (LLMs) are typically static after training, yet real-world applications require continual adaptation to new knowledge without degrading existing capabilities. Standard approaches to updating models, like full finetuning or parameter-efficient methods (e.g., LoRA), face a fundamental trade-off: catastrophic forgetting. They modify shared dense representations, causing interference across tasks. Sparse Memory Finetuning (SMF) offers a promising alternative by localizing updates to a small subset of parameters in explicit memory layers. In this work, we present an open-source pipeline to retrofit existing pretrained models (Qwen-2.5-0.5B) with sparse memory modules, enabling effective continual learning on consumer hardware. We extend prior work by introducing a theoretically grounded slot-selection mechanism based on Kullback-Leibler (KL) divergence, which prioritizes memory updates for informationally "surprising" tokens relative to a background distribution. Our experiments demonstrate that our retrofitted models can acquire new factual knowledge with minimal forgetting of held-out capabilities, validating the sparse update hypothesis in a practical setting.

Executive Summary

The article presents an innovative approach to addressing catastrophic forgetting in Large Language Models (LLMs) through Sparse Memory Finetuning (SMF). By retrofitting existing models (e.g., Qwen-2.5-0.5B) with explicit sparse memory modules, the authors demonstrate a practical solution for continual learning on consumer hardware. The key innovation lies in a theoretically grounded slot-selection mechanism based on KL divergence, which prioritizes memory updates for tokens with high informational novelty relative to a background distribution. Experimental results indicate that the retrofitted models can acquire new factual knowledge while preserving held-out capabilities, validating the hypothesis that sparse updates mitigate interference across tasks. This work bridges the gap between theoretical continual learning principles and real-world deployment constraints.

Key Points

  • Sparse Memory Finetuning (SMF) introduces explicit memory layers to localize updates, reducing interference and catastrophic forgetting in LLMs.
  • A novel slot-selection mechanism grounded in KL divergence prioritizes memory updates for 'surprising' tokens, enhancing the model's adaptability to novel information.
  • The approach is validated experimentally on Qwen-2.5-0.5B, demonstrating effective continual learning on consumer hardware with minimal degradation of existing capabilities.
  • The open-source pipeline enables retrofitting existing pretrained models, making the method accessible and practical for broader adoption.

Merits

Theoretical Rigor

The work leverages KL divergence for slot selection, providing a mathematically sound foundation for prioritizing memory updates based on informational novelty.

Practical Accessibility

The release of an open-source pipeline for retrofitting models democratizes the approach, making it feasible for researchers and practitioners to implement without specialized hardware.

Mitigation of Catastrophic Forgetting

By localizing updates to sparse memory layers, the method addresses the core challenge of interference in continual learning, preserving existing capabilities while acquiring new knowledge.

Scalability

The approach is validated on a mid-sized model (Qwen-2.5-0.5B), suggesting potential scalability to larger models with further optimization.

Demerits

Limited Model Scope

The experiments focus on a single model architecture (Qwen-2.5-0.5B), leaving open questions about generalization to other architectures (e.g., decoder-only vs. encoder-decoder) or larger models.

Computational Overhead

While more efficient than full finetuning, the addition of sparse memory layers introduces overhead in terms of memory usage and inference latency, which may impact real-time applications.

Dependence on Background Distribution

The effectiveness of the KL divergence-based slot selection relies heavily on the choice of the background distribution, which may not always be straightforward to define or may vary across tasks.

Evaluation Scope

The experiments emphasize factual knowledge acquisition and held-out capability preservation but do not extensively evaluate performance on complex reasoning tasks or multilingual settings.

Expert Commentary

The authors present a compelling case for sparse memory finetuning as a viable alternative to dense parameter updates in LLMs. The theoretical grounding of the slot-selection mechanism using KL divergence is particularly noteworthy, as it provides a principled way to identify and prioritize novel information for memory updates. This approach not only mitigates catastrophic forgetting but also offers a practical solution for continual learning on consumer hardware—a critical step toward making LLMs more adaptable in real-world settings. However, the work raises several questions for further exploration. For instance, how does the method perform in settings with high task overlap or where the background distribution is ill-defined? Additionally, while the open-source pipeline is a significant strength, its real-world applicability will depend on the ease of integration with existing model architectures and deployment pipelines. Future work could explore the scalability of SMF to larger models and more complex tasks, as well as its robustness to adversarial or noisy inputs. Overall, this work represents a meaningful contribution to the field of continual learning in LLMs, bridging the gap between theoretical advances and practical deployment challenges.

Recommendations

  • Conduct further experiments to evaluate SMF across a broader range of model architectures (e.g., larger LLMs, encoder-decoder models) and tasks (e.g., reasoning, multilingual settings) to assess generalization and robustness.
  • Develop guidelines or best practices for selecting the background distribution in the KL divergence-based slot-selection mechanism, particularly in domains where prior knowledge or task distributions are uncertain.
  • Explore hybrid approaches that combine SMF with other parameter-efficient finetuning methods (e.g., LoRA) to further optimize the trade-off between memory usage, computational efficiency, and performance.
  • Investigate the ethical implications of continual learning in LLMs, including potential biases introduced during model updates, and establish frameworks for auditing and mitigating these risks.
  • Expand the open-source pipeline to include pre-trained sparse memory modules for a wider variety of model families, reducing the barrier to adoption for practitioners.

Sources

Original: arXiv - cs.LG