Academic

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

arXiv:2603.03323v1 Announce Type: cross Abstract: Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse be

Y
Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi
· · 1 min read · 3 views

arXiv:2603.03323v1 Announce Type: cross Abstract: Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.

Executive Summary

This article proposes a novel approach to addressing the issue of over-refusal in large language models (LLMs) aligned for safety. The authors introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement, which improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. The method reduces over-refusal while preserving the safety benefits of alignment, with minimal degradation of general capabilities. The article presents both theoretical and empirical evidence supporting the effectiveness of DCR, demonstrating its potential as a more principled and robust direction for safety alignment. The authors' work has significant implications for the development and deployment of safe and reliable LLMs in various applications.

Key Points

  • The issue of over-refusal in LLMs arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics.
  • The proposed method, DCR: Discernment via Contrastive Refinement, addresses the issue of over-refusal by introducing a preceding alignment stage.
  • DCR improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones, reducing over-refusal while preserving safety benefits.

Merits

Improved Safety Alignment

The proposed method, DCR, improves the safety alignment of LLMs by reducing over-refusal and preserving the ability to reject genuinely harmful content.

Robustness and Generalizability

DCR demonstrates minimal degradation of general capabilities, making it a more principled and robust direction for safety alignment.

Demerits

Limited Evaluation Benchmarks

The article presents evaluation across diverse benchmarks, but the scope of these benchmarks may be limited, potentially restricting the generalizability of the results.

Dependence on Data Quality

The effectiveness of DCR may depend heavily on the quality of the training data, which can be a significant challenge in practice.

Expert Commentary

The article presents a significant contribution to the field of AI safety, addressing a critical issue in the development of large language models. The proposed method, DCR, demonstrates a more principled and robust direction for safety alignment, which can have far-reaching implications for the development and deployment of LLMs. However, the limitations of the evaluation benchmarks and the dependence on data quality are significant challenges that need to be addressed in future work. Overall, the article provides a compelling argument for the adoption of DCR as a standard approach to safety alignment in LLMs.

Recommendations

  • Future work should focus on evaluating the proposed method, DCR, across a wider range of benchmarks and domains to assess its generalizability and robustness.
  • The development of more robust and principled approaches to safety alignment in LLMs should be prioritized, with a focus on addressing the challenges of data quality and evaluation benchmarks.

Sources