Academic

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

arXiv:2603.11388v1 Announce Type: new Abstract: Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overref

arXiv:2603.11388v1 Announce Type: new Abstract: Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.

Executive Summary

The article addresses the issue of overrefusal in safety alignment, where large language models (LLMs) reject benign queries after post-training on harmful requests. The authors propose a mitigation strategy that considers refusal triggers in safety alignment fine-tuning, achieving a better trade-off between defense against jailbreak attacks and responsiveness to benign queries. The approach outperforms prior methods, providing a more effective solution for real-world applications. The study highlights the importance of understanding and mitigating overrefusal in safety alignment to ensure the usability of LLMs.

Key Points

  • Overrefusal in safety alignment degrades the usability of large language models
  • Refusal triggers in training data elicit refusal responses, including both harmful and non-harmful cues
  • The proposed method explicitly considers refusal triggers in safety alignment fine-tuning to mitigate overrefusal

Merits

Effective Mitigation Strategy

The proposed approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries

Demerits

Limited Generalizability

The study's findings may not be generalizable to all types of LLMs or safety alignment scenarios

Expert Commentary

The article provides a rigorous analysis of the overrefusal problem in safety alignment and proposes a novel mitigation strategy. The authors' mechanistic analysis of refusal triggers and their impact on LLMs' behavior is particularly insightful. The study's findings have significant implications for the development of more effective and responsible AI systems. However, further research is needed to fully address the complexities of bias and harm in AI systems.

Recommendations

  • Further research should be conducted to explore the generalizability of the proposed approach to different types of LLMs and safety alignment scenarios
  • The development of regulations and guidelines for the use of LLMs in sensitive contexts should take into account the study's findings on the importance of addressing bias and harm in AI systems

Sources