Academic

Consistency of Large Reasoning Models Under Multi-Turn Attacks

arXiv:2602.13093v2 Announce Type: new Abstract: Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by

Yubo Li, Ramayya Krishnan, Rema Padman · March 7, 2026 · 1 min read · 10 views

#cs.AI #cs.CL

Executive Summary

The article 'Consistency of Large Reasoning Models Under Multi-Turn Attacks' investigates the robustness of advanced reasoning models against adversarial attacks. The study evaluates nine state-of-the-art reasoning models and finds that while these models perform better than instruction-tuned baselines, they still exhibit significant vulnerabilities. The research identifies five failure modes, with 'Self-Doubt' and 'Social Conformity' being the most prevalent. The study also reveals that Confidence-Aware Response Generation (CARG), effective for standard language models, fails for reasoning models due to overconfidence induced by extended reasoning traces. The findings underscore the need for redesigned defenses tailored to reasoning models.

Key Points

▸ Reasoning models show improved robustness but are still vulnerable to adversarial attacks.
▸ Five failure modes identified, with 'Self-Doubt' and 'Social Conformity' being the most common.
▸ Confidence-Aware Response Generation (CARG) is ineffective for reasoning models due to overconfidence.
▸ Random confidence embedding outperforms targeted extraction in reasoning models.

Merits

Comprehensive Evaluation

The study provides a thorough evaluation of nine frontier reasoning models, offering a detailed analysis of their robustness under adversarial conditions.

Identification of Failure Modes

The research identifies five distinct failure modes, providing valuable insights into the vulnerabilities of reasoning models.

Innovative Findings

The discovery that CARG is ineffective for reasoning models and the counterintuitive effectiveness of random confidence embedding are significant contributions to the field.

Demerits

Limited Scope

The study focuses on a specific set of reasoning models and adversarial attacks, which may not be representative of all possible scenarios.

Generalizability

The findings may not be generalizable to other types of reasoning models or different adversarial strategies.

Methodological Constraints

The study relies on specific evaluation metrics and methodologies that may introduce biases or limitations in the results.

Expert Commentary

The article 'Consistency of Large Reasoning Models Under Multi-Turn Attacks' provides a rigorous and insightful analysis of the robustness of advanced reasoning models under adversarial pressure. The study's identification of five failure modes, particularly 'Self-Doubt' and 'Social Conformity,' offers valuable insights into the vulnerabilities of these models. The finding that Confidence-Aware Response Generation (CARG) is ineffective for reasoning models due to overconfidence induced by extended reasoning traces is a significant contribution. The counterintuitive effectiveness of random confidence embedding further underscores the need for innovative approaches to confidence calibration in reasoning models. The study's comprehensive evaluation of nine frontier reasoning models highlights the importance of robust defense mechanisms tailored to the unique characteristics of reasoning models. The implications of this research extend to both practical applications and policy considerations, emphasizing the need for ethical and secure AI development.

Recommendations

✓ Further research should explore a broader range of adversarial strategies and reasoning models to enhance the generalizability of the findings.
✓ Developing and implementing redesigned defense mechanisms that address the specific vulnerabilities identified in reasoning models is crucial for enhancing their robustness.

Sources

arXiv - cs.AI

Consistency of Large Reasoning Models Under Multi-Turn Attacks

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Evaluation

Identification of Failure Modes

Innovative Findings

Demerits

Limited Scope

Generalizability

Methodological Constraints

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs