Academic

Enhancing Instruction Following of LLMs via Activation Steering with Dynamic Rejection

arXiv:2603.06745v1 Announce Type: new Abstract: Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representati

M
Minjae Kang, Jaehyung Kim
· · 1 min read · 9 views

arXiv:2603.06745v1 Announce Type: new Abstract: Large Language Models (LLMs), despite advances in instruction tuning, often fail to follow complex user instructions. Activation steering techniques aim to mitigate this by manipulating model internals, but have a potential risk of oversteering, where excessive emphasis on the instruction degrades task accuracy and overall text quality. To address this, we introduce DIRECTER (Dynamic rejection steering), a novel steering method that dynamically modulates steering strength by scaling the KV cache without extra dataset. DIRECTER couples steering with a plausibility-guided decoding loop, which adaptively adjusts steering strength at each step by comparing the steered output distribution to the original. If the steered output is deemed implausible, steering strength is progressively weakened. This strength modulation is guided by a lightweight, one-time attention sensitivity analysis that ranks layers by their influence on model representations. Extensive evaluations show that DIRECTER significantly enhances instruction-following capabilities across diverse benchmarks, improving accuracy by up to 6.5% over baselines without the common trade-offs in generation quality or task fidelity. The proposed dynamic, plausibility-guided control during activation steering further demonstrates its potential as a general mechanism for mitigating oversteering that is compatible with existing baselines.

Executive Summary

This study proposes DIRECTER, a novel activation steering method that enhances instruction following in Large Language Models (LLMs) while mitigating oversteering. DIRECTER dynamically modulates steering strength by scaling the KV cache and incorporates a plausibility-guided decoding loop. The proposed method outperforms baselines across diverse benchmarks, improving accuracy by up to 6.5% without compromising generation quality or task fidelity. The study's findings suggest that DIRECTER can be a general mechanism for mitigating oversteering compatible with existing baselines. The proposed approach demonstrates potential for improving instruction following in LLMs, a critical challenge in natural language processing.

Key Points

  • DIRECTER is a novel activation steering method that dynamically modulates steering strength by scaling the KV cache.
  • The method incorporates a plausibility-guided decoding loop to adaptively adjust steering strength at each step.
  • DIRECTER outperforms baselines across diverse benchmarks, improving accuracy by up to 6.5% without compromising generation quality or task fidelity.

Merits

Strength

The proposed method demonstrates a significant improvement in instruction following capabilities, with accuracy gains up to 6.5% over baselines. Additionally, DIRECTER does not compromise generation quality or task fidelity, making it a desirable solution for real-world applications.

Demerits

Limitation

The study relies on extensive evaluations across diverse benchmarks, but it is unclear whether the results generalize to other domains or tasks. Furthermore, the proposed method may require significant computational resources and expertise for implementation.

Expert Commentary

The study proposes a novel and promising approach to mitigating oversteering in LLMs. The proposed method, DIRECTER, demonstrates a significant improvement in instruction following capabilities, with accuracy gains up to 6.5% over baselines. The study's findings and proposed method are a valuable contribution to the field of natural language processing, and their implications are far-reaching. However, the study's reliance on extensive evaluations across diverse benchmarks may limit its generalizability, and the proposed method may require significant computational resources and expertise for implementation. Nevertheless, the study's findings and proposed method warrant further investigation and exploration in the field.

Recommendations

  • Further research is necessary to investigate the generalizability of the proposed method to other domains and tasks.
  • The proposed method should be evaluated in real-world applications to assess its effectiveness and scalability.

Sources