Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover
arXiv:2603.11331v1 Announce Type: new Abstract: Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can …
Indranil Halder, Annesya Banerjee, Cengiz Pehlevan
4 views