Academic

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

arXiv:2603.06610v1 Announce Type: new Abstract: Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce \textbf{CapTrack}, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite built on established benchmarks and targeted adaptations. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that

Lukas Thede, Stefan Winzeck, Zeynep Akata, Jonathan Richard Schwarz · March 10, 2026 · 1 min read · 8 views

#cs.LG

Executive Summary

This study introduces CapTrack, a capability-centric framework for evaluating forgetting in large language models (LLMs) post-training. The authors argue that traditional accuracy-centric views of forgetting are insufficient and propose a behavioral taxonomy to assess model drift that degrades behavior and user experience. The study conducts a large-scale empirical analysis across post-training algorithms, domains, and model families, including models up to 80B parameters. The findings reveal that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. The study's results have significant implications for the development and deployment of LLMs, highlighting the need for more comprehensive evaluation frameworks and mitigation strategies.

Key Points

▸ CapTrack framework evaluates forgetting in LLMs post-training using a behavioral taxonomy.
▸ Forgetting extends beyond parametric knowledge, affecting robustness and default behaviors.
▸ Instruction fine-tuning induces strongest relative drift, while preference optimization is more conservative.

Merits

Strength in Methodology

The authors develop a comprehensive framework for evaluating forgetting in LLMs, addressing a significant gap in the field.

Strength in Rigor

The study conducts a large-scale empirical analysis across various post-training algorithms, domains, and model families, providing robust findings.

Demerits

Limitation in Generalizability

The study's findings may not be directly generalizable to all LLMs and post-training scenarios, highlighting the need for further research.

Limitation in Scalability

The analysis is computationally intensive, requiring significant resources, which may limit the framework's practical applicability.

Expert Commentary

The CapTrack framework represents a significant advancement in the evaluation of forgetting in LLMs post-training. By shifting the focus from accuracy-centric views to behavioral taxonomy, the authors provide a more comprehensive understanding of model drift and adaptation. While the study's findings are robust, the limitations in generalizability and scalability highlight the need for further research and development. As the field continues to evolve, it is essential to address the implications of model forgetting on user experience and behavior, ensuring that LLMs are developed and deployed responsibly.

Recommendations

✓ Future research should focus on developing more scalable and computationally efficient evaluation frameworks, such as CapTrack, to facilitate widespread adoption.
✓ LLM developers and researchers should prioritize the development of more targeted optimization strategies to mitigate forgetting, particularly in instruction fine-tuning and preference optimization.

Sources

arXiv - cs.LG

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

AI Commentary

Executive Summary

Key Points

Merits

Strength in Methodology

Strength in Rigor

Demerits

Limitation in Generalizability

Limitation in Scalability

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs