Skip to main content
Academic

Fail-Closed Alignment for Large Language Models

arXiv:2602.16977v1 Announce Type: new Abstract: We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature$-$via prompt-based jailbreaks$-$can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small compu

Z
Zachary Coalson, Beth Sohler, Aiden Gabriel, Sanghyun Hong
· · 1 min read · 15 views

arXiv:2602.16977v1 Announce Type: new Abstract: We identify a structural weakness in current large language model (LLM) alignment: modern refusal mechanisms are fail-open. While existing approaches encode refusal behaviors across multiple latent features, suppressing a single dominant feature$-$via prompt-based jailbreaks$-$can cause alignment to collapse, leading to unsafe generation. Motivated by this, we propose fail-closed alignment as a design principle for robust LLM safety: refusal mechanisms should remain effective even under partial failures via redundant, independent causal pathways. We present a concrete instantiation of this principle: a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. Across four jailbreak attacks, we achieve the strongest overall robustness while mitigating over-refusal and preserving generation quality, with small computational overhead. Our mechanistic analyses confirm that models trained with our method encode multiple, causally independent refusal directions that prompt-based jailbreaks cannot suppress simultaneously, providing empirical support for fail-closed alignment as a principled foundation for robust LLM safety.

Executive Summary

The article proposes a novel approach to large language model (LLM) safety, introducing the concept of fail-closed alignment. The authors identify a structural weakness in current LLM alignment, where refusal mechanisms are fail-open, making them vulnerable to partial failures. To address this issue, they propose a progressive alignment framework that iteratively identifies and ablates previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces. The authors demonstrate the effectiveness of their approach through four jailbreak attacks, achieving stronger overall robustness while mitigating over-refusal and preserving generation quality. The article provides empirical support for fail-closed alignment as a principled foundation for robust LLM safety, with potential implications for the development of safer and more reliable LLMs.

Key Points

  • Fail-closed alignment is proposed as a design principle for robust LLM safety, where refusal mechanisms remain effective even under partial failures.
  • A progressive alignment framework is introduced to iteratively identify and ablate previously learned refusal directions, forcing the model to reconstruct safety along new, independent subspaces.
  • The approach achieves stronger overall robustness, mitigates over-refusal, and preserves generation quality in four jailbreak attacks.

Merits

Strength in addressing a critical issue

The article identifies a significant weakness in current LLM alignment and proposes a novel approach to address it, providing a principled foundation for robust LLM safety.

Empirical support

The authors provide empirical evidence through four jailbreak attacks, demonstrating the effectiveness of their approach in achieving stronger overall robustness and mitigating over-refusal.

Computationally efficient

The proposed approach has a small computational overhead, making it a practical solution for large-scale LLM safety applications.

Demerits

Limited generalizability

The article's findings are based on a specific LLM architecture and dataset, and it is unclear whether the approach will generalize to other LLMs and domains.

Dependence on specific threat models

The proposed approach is designed to address specific jailbreak attacks, and it is unclear whether it will be effective against other types of threats or vulnerabilities.

Lack of transparency

The article does not provide a clear understanding of how the progressive alignment framework works, making it challenging to replicate or extend the results.

Expert Commentary

The article presents a novel approach to LLM safety, addressing a critical issue in the field. However, the approach is not without limitations, and further research is needed to fully understand its potential and limitations. The proposed progressive alignment framework shows promise, but its generalizability and dependence on specific threat models are areas that require further investigation. Additionally, the article highlights the need for more transparent and explainable AI systems, particularly in applications where safety and reliability are critical. Overall, the article contributes to the ongoing discussion on LLM safety and provides a valuable starting point for further research.

Recommendations

  • Further investigation is needed to fully understand the potential and limitations of the proposed approach.
  • The article's findings should be replicated and extended to other LLM architectures and datasets to assess its generalizability.
  • The development of more transparent and explainable AI systems is essential for ensuring the reliability and safety of LLMs.

Sources