Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
arXiv:2603.03332v1 Announce Type: new Abstract: Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion rem
arXiv:2603.03332v1 Announce Type: new Abstract: Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30\% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0-6\%) regardless of scale; Sycophancy produces modest effects (7\% loss for small models); and SkippedSteps cause intermediate damage (15\% loss). Scaling relationships follow power-law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available \href{https://github.com/Mystic-Slice/CoTPerturbation}{here}.
Executive Summary
This study examines the robustness of Large Language Models (LLMs) to Chain-of-Thought (CoT) perturbations, a structured taxonomy of five CoT perturbation types. The authors present a comprehensive empirical evaluation of 13 LLMs with varying parameter counts, testing their ability to complete mathematical reasoning tasks under different perturbation scenarios. The study reveals heterogeneous vulnerability patterns across models and perturbation types, with MathError and UnitConversion being the most challenging. The findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The availability of code and results enables replication and extension of the research.
Key Points
- ▸ The study evaluates 13 LLMs with varying parameter counts, spanning three orders of magnitude.
- ▸ A structured taxonomy of five CoT perturbation types is introduced, including MathError, UnitConversion, Sycophancy, SkippedSteps, and ExtraSteps.
- ▸ Heterogeneous vulnerability patterns are observed across models and perturbation types.
Merits
Strength in Methodology
The study employs a comprehensive and structured approach to evaluating LLM robustness, using a taxonomy of perturbation types and a large set of models with varying parameter counts.
Strength in Findings
The study reveals heterogeneous vulnerability patterns across models and perturbation types, providing valuable insights into LLM robustness and its implications for deployment.
Demerits
Limitation in Generalizability
The study focuses on mathematical reasoning tasks and may not be generalizable to other domains or tasks.
Limitation in Model Selection
The study selects a limited set of models with varying parameter counts, which may not be representative of the broader range of models available.
Expert Commentary
This study makes a significant contribution to the understanding of LLM robustness and its implications for deployment. The findings highlight the importance of task-specific robustness assessments and mitigation strategies, which is crucial for ensuring the reliability and trustworthiness of LLMs in various applications. However, the study's limitation in generalizability and model selection may limit the broader applicability of its findings. Nevertheless, the study provides valuable insights into LLM robustness and has implications for the development and deployment of LLMs in various domains.
Recommendations
- ✓ Future studies should explore the generalizability of the study's findings to other domains and tasks.
- ✓ The development of more robust and reliable LLMs requires consideration of task-specific robustness assessments and mitigation strategies.