Consistency of Large Reasoning Models Under Multi-Turn Attacks
arXiv:2602.13093v2 Announce Type: new Abstract: Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by
arXiv:2602.13093v2 Announce Type: new Abstract: Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.
Executive Summary
The article 'Consistency of Large Reasoning Models Under Multi-Turn Attacks' investigates the robustness of advanced reasoning models against adversarial attacks. The study evaluates nine state-of-the-art reasoning models and finds that while these models perform better than instruction-tuned baselines, they still exhibit significant vulnerabilities. The research identifies five failure modes, with 'Self-Doubt' and 'Social Conformity' being the most prevalent. The study also reveals that Confidence-Aware Response Generation (CARG), effective for standard language models, fails for reasoning models due to overconfidence induced by extended reasoning traces. The findings underscore the need for redesigned defenses tailored to reasoning models.
Key Points
- ▸ Reasoning models show improved robustness but are still vulnerable to adversarial attacks.
- ▸ Five failure modes identified, with 'Self-Doubt' and 'Social Conformity' being the most common.
- ▸ Confidence-Aware Response Generation (CARG) is ineffective for reasoning models due to overconfidence.
- ▸ Random confidence embedding outperforms targeted extraction in reasoning models.
Merits
Comprehensive Evaluation
The study provides a thorough evaluation of nine frontier reasoning models, offering a detailed analysis of their robustness under adversarial conditions.
Identification of Failure Modes
The research identifies five distinct failure modes, providing valuable insights into the vulnerabilities of reasoning models.
Innovative Findings
The discovery that CARG is ineffective for reasoning models and the counterintuitive effectiveness of random confidence embedding are significant contributions to the field.
Demerits
Limited Scope
The study focuses on a specific set of reasoning models and adversarial attacks, which may not be representative of all possible scenarios.
Generalizability
The findings may not be generalizable to other types of reasoning models or different adversarial strategies.
Methodological Constraints
The study relies on specific evaluation metrics and methodologies that may introduce biases or limitations in the results.
Expert Commentary
The article 'Consistency of Large Reasoning Models Under Multi-Turn Attacks' provides a rigorous and insightful analysis of the robustness of advanced reasoning models under adversarial pressure. The study's identification of five failure modes, particularly 'Self-Doubt' and 'Social Conformity,' offers valuable insights into the vulnerabilities of these models. The finding that Confidence-Aware Response Generation (CARG) is ineffective for reasoning models due to overconfidence induced by extended reasoning traces is a significant contribution. The counterintuitive effectiveness of random confidence embedding further underscores the need for innovative approaches to confidence calibration in reasoning models. The study's comprehensive evaluation of nine frontier reasoning models highlights the importance of robust defense mechanisms tailored to the unique characteristics of reasoning models. The implications of this research extend to both practical applications and policy considerations, emphasizing the need for ethical and secure AI development.
Recommendations
- ✓ Further research should explore a broader range of adversarial strategies and reasoning models to enhance the generalizability of the findings.
- ✓ Developing and implementing redesigned defense mechanisms that address the specific vulnerabilities identified in reasoning models is crucial for enhancing their robustness.