Skip to main content
Academic

Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning

arXiv:2602.19612v1 Announce Type: new Abstract: Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

arXiv:2602.19612v1 Announce Type: new Abstract: Machine Unlearning (MU) enables Large Language Models (LLMs) to remove unsafe or outdated information. However, existing work assumes that all facts are equally forgettable and largely ignores whether the forgotten knowledge originates from pretraining or supervised fine-tuning (SFT). In this paper, we introduce DUAL (Dual Unlearning Evaluation across Training Stages), a benchmark of 28.6k Wikidata-derived triplets annotated with fact popularity using Wikipedia link counts and LLM-based salience scores. Our experiments show that pretrained and SFT models respond differently to unlearning. An SFT step on the forget data yields smoother forgetting, more stable tuning, and 10-50% higher retention, while direct unlearning on pretrained models remains unstable and prone to relearning or catastrophic forgetting.

Executive Summary

The article 'Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning' explores the nuances of Machine Unlearning (MU) in Large Language Models (LLMs), focusing on the differential impact of pretraining and supervised fine-tuning (SFT) on the unlearning process. The study introduces DUAL, a benchmark dataset of 28.6k Wikidata-derived triplets annotated with fact popularity metrics. The research reveals that SFT models exhibit smoother forgetting, more stable tuning, and higher retention rates compared to pretrained models, which are prone to relearning or catastrophic forgetting. This work underscores the importance of considering the origin of knowledge in the unlearning process.

Key Points

  • Introduction of DUAL benchmark for evaluating unlearning across training stages.
  • Differential impact of pretraining and SFT on unlearning effectiveness.
  • SFT models show smoother forgetting and higher retention compared to pretrained models.

Merits

Comprehensive Benchmark

The introduction of the DUAL benchmark provides a robust and annotated dataset for evaluating unlearning, which is a significant contribution to the field.

Empirical Evidence

The study presents empirical evidence showing the differential impact of pretraining and SFT on unlearning, which is crucial for understanding the nuances of MU.

Practical Insights

The findings offer practical insights into improving the stability and effectiveness of unlearning in LLMs, which can be directly applied in real-world scenarios.

Demerits

Limited Scope

The study focuses primarily on Wikidata-derived triplets, which may not be representative of all types of knowledge or unlearning scenarios.

Generalizability

The results may not be generalizable to all LLMs or unlearning techniques, as the study is based on specific models and methods.

Data Bias

The use of Wikipedia link counts and LLM-based salience scores for annotating fact popularity may introduce biases that could affect the validity of the findings.

Expert Commentary

The study 'Anatomy of Unlearning: The Dual Impact of Fact Salience and Model Fine-Tuning' presents a significant advancement in the field of Machine Unlearning. By introducing the DUAL benchmark and providing empirical evidence on the differential impact of pretraining and SFT, the research offers valuable insights into the nuances of unlearning in LLMs. The findings highlight the importance of considering the origin of knowledge in the unlearning process, which is crucial for developing more stable and effective unlearning mechanisms. However, the study's focus on Wikidata-derived triplets and the potential biases in the annotation process warrant further investigation to ensure the generalizability and validity of the results. Overall, this research contributes to the ongoing efforts to improve the interpretability, ethics, and privacy of AI systems, and it provides a solid foundation for future studies in this area.

Recommendations

  • Future research should explore the applicability of the findings to a broader range of knowledge types and unlearning scenarios.
  • Developers should consider incorporating the insights from this study into their unlearning mechanisms to improve model stability and effectiveness.

Sources