Academic

Detoxifying LLMs via Representation Erasure-Based Preference Optimization

arXiv:2602.23391v1 Announce Type: new Abstract: Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits

arXiv:2602.23391v1 Announce Type: new Abstract: Large language models (LLMs) trained on webscale data can produce toxic outputs, raising concerns for safe deployment. Prior defenses, based on applications of DPO, NPO, and similar algorithms, reduce the likelihood of harmful continuations, but not robustly so: they are vulnerable to adversarial prompting and easily undone by fine-tuning-based relearning attacks. Indeed, research has shown that these edits to the model are superficial: linear probing reveals that harmful "directions" remain present in representations. To address this, we propose Representation Erasure-based Preference Optimization (REPO), reformulating detoxification as a token-level preference problem. Using a novel objective with preference data, we force the representations of toxic continuations to converge toward their benign counterparts. Our mechanistic analysis reveals that this granular approach is critical: unlike baselines, REPO induces deep, localized edits to toxicity-encoding neurons while preserving general model utility. Exhaustive evaluations show that REPO achieves state-of-the-art robustness, stopping sophisticated threats-including relearning attacks and enhanced GCG jailbreaks-where existing representation- and output-based methods fail.

Executive Summary

This article proposes a novel approach to detoxifying large language models (LLMs) by reformulating detoxification as a token-level preference problem. The Representation Erasure-based Preference Optimization (REPO) method uses a novel objective with preference data to force the representations of toxic continuations to converge towards their benign counterparts. The authors demonstrate that REPO achieves state-of-the-art robustness against sophisticated threats, including relearning attacks and enhanced GCG jailbreaks, by inducing deep, localized edits to toxicity-encoding neurons while preserving general model utility. This approach presents a significant advancement in the field of LLM detoxification, as it addresses the limitations of existing methods that rely on superficial edits to the model. The implications of REPO are far-reaching and have the potential to improve the safe deployment of LLMs in various applications.

Key Points

  • Representation Erasure-based Preference Optimization (REPO) is a novel approach to detoxifying LLMs.
  • REPO reformulates detoxification as a token-level preference problem.
  • REPO achieves state-of-the-art robustness against sophisticated threats.

Merits

Strength in Robustness

REPO demonstrates exceptional robustness against sophisticated threats, including relearning attacks and enhanced GCG jailbreaks.

Localized Edits

REPO induces deep, localized edits to toxicity-encoding neurons, preserving general model utility.

Granular Approach

The granular approach of REPO is critical in addressing the limitations of existing methods.

Demerits

High Computational Requirements

REPO may require significant computational resources due to its novel objective with preference data.

Limited Generalizability

The effectiveness of REPO may be limited to specific LLM architectures and datasets.

Expert Commentary

The authors of this article have made significant contributions to the field of LLM detoxification. REPO presents a novel and effective approach to addressing the limitations of existing methods. The expert analysis of the paper reveals that the granular approach of REPO is critical in inducing deep, localized edits to toxicity-encoding neurons. Furthermore, REPO demonstrates exceptional robustness against sophisticated threats, including relearning attacks and enhanced GCG jailbreaks. The implications of REPO are far-reaching and have the potential to improve the safe deployment of LLMs in various applications. However, it is essential to consider the high computational requirements and limited generalizability of REPO. Future research should focus on addressing these limitations and exploring the applicability of REPO to other LLM architectures and datasets.

Recommendations

  • Further research is needed to explore the applicability of REPO to other LLM architectures and datasets.
  • Developing more efficient computational methods for REPO could significantly improve its practicality.
  • Regulatory frameworks for LLM safety and governance should be revised to accommodate the advanced detoxification techniques.

Sources