Academic

Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

arXiv:2603.11149v1 Announce Type: new Abstract: Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient

X
Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran
· · 1 min read · 13 views

arXiv:2603.11149v1 Announce Type: new Abstract: Large language models remain vulnerable to jailbreak attacks, yet we still lack a systematic understanding of how jailbreak success scales with attacker effort across methods, model families, and harm types. We initiate a scaling-law framework for jailbreaks by treating each attack as a compute-bounded optimization procedure and measuring progress on a shared FLOPs axis. Our systematic evaluation spans four representative jailbreak paradigms, covering optimization-based attacks, self-refinement prompting, sampling-based selection, and genetic optimization, across multiple model families and scales on a diverse set of harmful goals. We investigate scaling laws that relate attacker budget to attack success score by fitting a simple saturating exponential function to FLOPs--success trajectories, and we derive comparable efficiency summaries from the fitted curves. Empirically, prompting-based paradigms tend to be the most compute-efficient compared to optimization-based methods. To explain this gap, we cast prompt-based updates into an optimization view and show via a same-state comparison that prompt-based attacks more effectively optimize in prompt space. We also show that attacks occupy distinct success--stealthiness operating points with prompting-based methods occupying the high-success, high-stealth region. Finally, we find that vulnerability is strongly goal-dependent: harms involving misinformation are typically easier to elicit than other non-misinformation harms.

Executive Summary

This article presents a systematic scaling analysis of jailbreak attacks in large language models, identifying vulnerabilities and proposing a framework for evaluating attack success. The authors compare four representative jailbreak paradigms across multiple model families and scales, finding that prompting-based methods are the most compute-efficient. They also demonstrate that attacks occupy distinct success-stealthiness operating points, with prompting-based methods achieving high success and stealth. Furthermore, the study shows that vulnerability is goal-dependent, with misinformation harms being easier to elicit than non-misinformation harms. This research contributes to a deeper understanding of jailbreak attacks and their mitigation.

Key Points

  • The authors propose a scaling-law framework for evaluating jailbreak attacks
  • Prompting-based methods are the most compute-efficient among the compared paradigms
  • Attacks occupy distinct success-stealthiness operating points

Merits

Strength

The study provides a comprehensive and systematic evaluation of jailbreak attacks, covering multiple model families and scales. The proposed scaling-law framework offers a unified approach to evaluating attack success.

Demerits

Limitation

The study focuses on a limited set of jailbreak paradigms, which may not be representative of all possible attack methods. Additionally, the evaluation of model families and scales may not be exhaustive.

Expert Commentary

The study's systematic approach to evaluating jailbreak attacks provides valuable insights into the vulnerabilities of large language models. The proposed scaling-law framework offers a promising direction for future research. However, the study's limitations should be acknowledged, and further research is needed to address the gap in the evaluation of other jailbreak paradigms. The findings have significant implications for both practical applications and policy development.

Recommendations

  • Future research should aim to expand the evaluation of jailbreak paradigms to include a broader range of methods and models.
  • Developing more robust defense mechanisms against jailbreak attacks requires a deeper understanding of the underlying vulnerabilities and the proposed scaling-law framework offers a valuable starting point.

Sources