Skip to main content
Academic

Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

arXiv:2602.17546v1 Announce Type: new Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that h

J
Jyotin Goel, Souvik Maji, Pratik Mazumder
· · 1 min read · 5 views

arXiv:2602.17546v1 Announce Type: new Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.

Executive Summary

This article introduces a novel training framework that adapts regularization in response to safety risk, enabling instruction-following language models to remain aligned throughout fine-tuning. The framework estimates safety risk using a Safety Critic or a risk predictor built with a lightweight classifier, and constrains updates deemed higher risk to remain close to a safe reference policy. Empirical results demonstrate that adaptive regularization lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. The study provides a principled mechanism for maintaining safety without sacrificing utility, addressing a critical gap in existing defenses. The proposed approach has significant implications for the development of safe and reliable language models, particularly in high-stakes applications such as healthcare and finance.

Key Points

  • Adaptive regularization framework for maintaining safety during fine-tuning
  • Use of Safety Critic and risk predictor for safety risk estimation
  • Empirical results demonstrate improved safety without sacrificing utility

Merits

Robust Safety Mechanism

The framework's adaptive nature enables models to remain aligned throughout fine-tuning, ensuring robust safety in the face of benign or adversarial updates.

Improved Utility

The study demonstrates that adaptive regularization preserves downstream performance, making it a practical solution for real-world applications.

Principled Mechanism

The proposed approach provides a principled mechanism for maintaining safety without sacrificing utility, addressing a critical gap in existing defenses.

Demerits

Limited Generalizability

The study's results may not generalize to other model families or domains, requiring further investigation to ensure the framework's effectiveness in diverse settings.

Dependence on Risk Estimation

The framework's performance relies heavily on accurate risk estimation, which may be challenging to achieve in practice, particularly for complex tasks or diverse datasets.

Expert Commentary

The article's proposed framework is a significant contribution to the field of AI safety and reliability. The use of adaptive regularization and risk estimation approaches enables models to remain aligned throughout fine-tuning, addressing a critical gap in existing defenses. However, the study's results may not generalize to other model families or domains, and the framework's performance relies heavily on accurate risk estimation. Nevertheless, the proposed approach has significant implications for the development of safe and reliable language models, and its findings have important policy implications in high-stakes applications. Future research should aim to investigate the framework's effectiveness in diverse settings and explore its applicability to other AI domains.

Recommendations

  • Further investigation into the framework's generalizability and effectiveness in diverse settings
  • Exploration of the framework's applicability to other AI domains and high-stakes applications

Sources