Academic

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering

arXiv:2604.05179v1 Announce Type: new Abstract: Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTes

P
Purva Chiniya, Kevin Scaria, Sagar Chaturvedi
· · 1 min read · 3 views

arXiv:2604.05179v1 Announce Type: new Abstract: Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over-refuse benign queries and degrade user experience. Previous work on jailbreak and prompt injection detection such as GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines an acceptance anchor token ("Sure") and refusal anchor token ("Sorry") tightening the decision boundary and significantly lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can't...") before autoregressive decoding resumes, guaranteeing first-token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 10% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8x7B, and Qwen-2-7B, and requires only 20 demonstration templates.

Executive Summary

The article introduces Gradient-Controlled Decoding (GCD), a novel, training-free guardrail mechanism designed to mitigate jailbreak and prompt-injection attacks in Large Language Models (LLMs) while addressing the over-refusal problem of existing defenses. By leveraging dual-anchor steering—combining acceptance ('Sure') and refusal ('Sorry') tokens—GCD tightens the decision boundary, reducing false positives by 52% compared to GradSafe at comparable recall rates. The method guarantees first-token safety by pre-injecting refusal tokens if a prompt is flagged, ensuring deterministic harm mitigation regardless of sampling strategies. With minimal latency (15-20 ms on V100 instances) and broad model transferability (e.g., LLaMA-2-7B, Mixtral-8x7B), GCD demonstrates robust performance across ToxicChat, XSTest-v2, and AdvBench datasets, requiring only 20 demonstration templates for implementation.

Key Points

  • GCD introduces a training-free, dual-anchor steering mechanism (acceptance 'Sure' and refusal 'Sorry' tokens) to tighten the decision boundary in LLM safety guardrails, reducing false positives by 52% compared to GradSafe while maintaining comparable recall.
  • The method guarantees first-token safety by pre-injecting refusal tokens ('Sorry, I can't...') if a prompt is flagged, ensuring deterministic harm mitigation regardless of subsequent sampling strategies or decoding parameters.
  • GCD demonstrates low latency (15-20 ms on V100 instances), broad model transferability (e.g., LLaMA-2-7B, Mixtral-8x7B, Qwen-2-7B), and strong performance across multiple benchmarks (ToxicChat, XSTest-v2, AdvBench) with minimal template requirements (20 demonstrations).

Merits

Novel Dual-Anchor Mechanism

The introduction of paired acceptance and refusal anchor tokens represents a significant advancement over single-anchor systems like GradSafe, improving robustness by tightening the decision boundary and reducing false positives without sacrificing recall.

Training-Free and Lightweight

GCD operates without requiring retraining or fine-tuning of LLMs, relying instead on minimal demonstration templates (20) and gradient-controlled steering, making it computationally efficient and easy to deploy across diverse models and settings.

Deterministic First-Token Safety Guarantee

By pre-injecting refusal tokens when a prompt is flagged, GCD ensures that harmful content is blocked from the outset, addressing a critical vulnerability in existing decoding-only defenses that lack such guarantees.

Broad Applicability and Transferability

The method’s low latency and compatibility with multiple models (e.g., LLaMA-2-7B, Mixtral-8x7B, Qwen-2-7B) suggest strong potential for widespread adoption in both research and industry applications.

Demerits

Limited Generalization Beyond Tested Benchmarks

While GCD performs well on ToxicChat, XSTest-v2, and AdvBench, its effectiveness against novel or highly sophisticated attack vectors (e.g., multi-modal jailbreaks, adversarial in-context learning) remains untested and may require further validation.

Dependency on Anchor Token Selection

The efficacy of GCD relies heavily on the choice of anchor tokens ('Sure' and 'Sorry'), which may not generalize across languages, cultural contexts, or model-specific tokenization schemes, potentially limiting its universality.

Latency Concerns in Real-Time Applications

Although the reported latency (15-20 ms) is low for batch processing, real-time applications with strict latency constraints (e.g., high-frequency chatbots or streaming services) may require additional optimization to meet performance thresholds.

Ethical and Misuse Risks in Refusal Injection

Pre-injecting refusal tokens could inadvertently reinforce overly cautious behavior in LLMs, leading to user dissatisfaction or 'over-defensiveness' in benign but edge-case queries, which warrants careful tuning and monitoring.

Expert Commentary

The introduction of Gradient-Controlled Decoding (GCD) represents a notable advancement in the field of LLM safety guardrails, particularly in addressing the persistent challenges of jailbreak attacks and over-refusal. By leveraging a dual-anchor steering mechanism, GCD not only tightens the decision boundary between safe and unsafe prompts but also provides a deterministic guarantee of harm mitigation through first-token refusal injection. This approach is both elegant and pragmatic, offering a training-free solution that is computationally lightweight and broadly transferable across models. The empirical results—demonstrating a 52% reduction in false positives and up to a 10% reduction in attack success rates—are compelling and suggest that GCD could become a gold standard for safety guardrails in deployed LLMs. However, the reliance on specific anchor tokens and the potential for over-defensiveness in benign edge cases warrant careful consideration. Future work should explore the generalization of GCD to novel attack vectors and languages, as well as its integration with other safety mechanisms to create layered defense strategies. Overall, GCD sets a new benchmark for training-free, deterministic LLM safety, and its adoption could significantly enhance the robustness of AI systems in high-stakes applications.

Recommendations

  • Conduct further testing to validate GCD’s effectiveness against novel and multi-modal attack vectors, ensuring its robustness in evolving threat landscapes.
  • Expand research on anchor token selection and cultural generalization to ensure GCD’s applicability across diverse linguistic and contextual settings.
  • Integrate GCD with other safety mechanisms (e.g., post-hoc filters, adversarial training) to create a multi-layered defense strategy that balances determinism with adaptability.
  • Develop guidelines for tuning GCD’s refusal injection thresholds to minimize over-defensiveness in benign queries while maintaining high safety standards, particularly in user-facing applications.
  • Explore the ethical implications of pre-injected refusal tokens, including their potential to reinforce overly cautious behavior in LLMs, and establish best practices for monitoring and mitigating unintended consequences.

Sources

Original: arXiv - cs.CL