Academic

Improving Safety Alignment via Balanced Direct Preference Optimization

arXiv:2603.22829v1 Announce Type: new Abstract: With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual

arXiv:2603.22829v1 Announce Type: new Abstract: With the rapid development and widespread application of Large Language Models (LLMs), their potential safety risks have attracted widespread attention. Reinforcement Learning from Human Feedback (RLHF) has been adopted to enhance the safety performance of LLMs. As a simple and effective alternative to RLHF, Direct Preference Optimization (DPO) is widely used for safety alignment. However, safety alignment still suffers from severe overfitting, which limits its actual performance. This paper revisits the overfitting phenomenon from the perspective of the model's comprehension of the training data. We find that the Imbalanced Preference Comprehension phenomenon exists between responses in preference pairs, which compromises the model's safety performance. To address this, we propose Balanced Direct Preference Optimization (B-DPO), which adaptively modulates optimization strength between preferred and dispreferred responses based on mutual information. A series of experimental results show that B-DPO can enhance the safety capability while maintaining the competitive general capabilities of LLMs on various mainstream benchmarks compared to state-of-the-art methods. \color{red}{Warning: This paper contains examples of harmful texts, and reader discretion is recommended.

Executive Summary

This paper addresses a critical challenge in safety alignment of Large Language Models (LLMs) by identifying and mitigating the overfitting problem in Direct Preference Optimization (DPO). The authors detect an 'Imbalanced Preference Comprehension' phenomenon—where models disproportionately favor preferred responses over dispreferred ones—leading to compromised safety performance. To counter this, the authors propose Balanced Direct Preference Optimization (B-DPO), which dynamically adjusts optimization weights using mutual information between preference pairs. Experimental results indicate that B-DPO preserves general LLM capabilities while improving safety across benchmark evaluations. The work is notable for its analytical depth and practical applicability, offering a novel mechanism to enhance safety without sacrificing utility.

Key Points

  • Identification of Imbalanced Preference Comprehension as a cause of overfitting in DPO
  • Proposal of B-DPO as a novel adaptive optimization framework using mutual information
  • Empirical validation showing improved safety alignment without loss of general performance

Merits

Novelty

B-DPO introduces a statistically grounded adaptive mechanism based on mutual information, offering a differentiated approach to safety alignment.

Demerits

Risk of Harmful Content Exposure

The paper contains examples of harmful texts, raising ethical concerns regarding dissemination and reader discretion.

Expert Commentary

The paper makes a significant contribution to the field of safety alignment by diagnosing a subtle yet impactful issue—Imbalanced Preference Comprehension—that undermines the efficacy of current DPO-based approaches. The proposed B-DPO framework is intellectually rigorous, leveraging information-theoretic principles to rebalance the learning dynamics between preferred and dispreferred data. This is a meaningful step forward in refining safety-oriented training paradigms. However, the ethical dimension of including harmful content in the paper cannot be ignored. While the research is methodologically sound, the inclusion of such content necessitates careful editorial oversight and contextual framing to mitigate unintended consequences. From a legal and academic perspective, the work demonstrates a commendable alignment between technical innovation and safety-critical objectives, yet it also underscores the necessity for broader discourse on the responsibilities of researchers when disseminating sensitive material.

Recommendations

  • 1. Researchers should consider transparent editorial guidelines for content inclusion, particularly when dealing with sensitive or harmful examples.
  • 2. Adoption of B-DPO in real-world safety alignment pipelines should be evaluated in conjunction with independent ethical review boards to ensure responsible deployment.

Sources

Original: arXiv - cs.AI