Academic

RAGEN-2: Reasoning Collapse in Agentic RL

arXiv:2604.06268v1 Announce Type: new Abstract: RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain templ

arXiv:2604.06268v1 Announce Type: new Abstract: RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.

Executive Summary

RAGEN-2 introduces a critical diagnostic for reasoning collapse in multi-turn LLM agents trained with Reinforcement Learning (RL). The authors identify 'template collapse,' a failure mode where models produce seemingly diverse but input-agnostic responses, undetected by conventional entropy metrics. They propose decomposing reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), demonstrating MI's superior correlation with task performance. The paper attributes template collapse to a low signal-to-noise ratio (SNR) mechanism, where weak task gradients allow regularization to erase input-specific reasoning. To mitigate this, SNR-Aware Filtering, using reward variance as a proxy, is introduced to select high-signal prompts, consistently improving input dependence and task performance across various domains.

Key Points

  • Traditional entropy metrics are insufficient to detect 'template collapse' in LLM agents, where responses appear diverse but lack input-specificity.
  • RAGEN-2 decomposes reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI).
  • Mutual Information (MI) significantly outperforms entropy as a proxy for reasoning quality, correlating more strongly with final task performance.
  • Template collapse is explained by a low signal-to-noise ratio (SNR) mechanism, where weak task gradients are overwhelmed by regularization, leading to input-agnostic reasoning.
  • SNR-Aware Filtering, using reward variance to select high-signal prompts, effectively addresses template collapse and improves both input dependence and task performance.

Merits

Novel Diagnostic Metric

The introduction of Mutual Information (MI) as a diagnostic for 'cross-input distinguishability' addresses a critical blind spot in existing evaluation methodologies for LLM agents.

Robust Empirical Validation

The consistent performance improvements and strong correlations across diverse tasks (planning, math, web navigation, code execution) lend significant credibility to the proposed methods and insights.

Mechanistic Explanation

The SNR mechanism provides a compelling theoretical framework for understanding template collapse, moving beyond mere observation to offer a causal explanation.

Practical Intervention

SNR-Aware Filtering offers a lightweight and effective method for practitioners to improve the reasoning quality and task performance of their RL-trained LLM agents.

Demerits

Computational Cost of MI

While proxies are introduced, calculating true Mutual Information can be computationally intensive, potentially limiting its scalability for very large models or datasets without efficient approximations.

Generality of SNR Mechanism

While compelling, the extent to which the SNR mechanism fully explains all forms of reasoning collapse or instability in diverse RL environments warrants further investigation beyond the specific tasks explored.

Reliance on Reward Variance

Using reward variance as a proxy for 'signal' is intuitive but might not always perfectly capture the true information content or gradient quality in all complex reward landscapes.

Expert Commentary

RAGEN-2 provides a profoundly insightful analysis into a subtle yet critical pathology within the current paradigm of RL-trained LLM agents. The identification of 'template collapse' is not merely an incremental improvement in diagnostic capability; it represents a conceptual leap, revealing a fundamental limitation of entropy as a sole measure of reasoning stability. The decomposition of reasoning into diversity and distinguishability, with Mutual Information serving as the crucial missing link, is elegantly conceived and empirically robust. The SNR mechanism offers a compelling theoretical underpinning, elucidating *why* such collapse occurs – a vital step beyond 'what' and 'how.' This work will undoubtedly become a foundational reference for anyone developing or deploying agentic LLMs, shifting the focus from superficial output diversity to genuine input-dependent reasoning. Its implications extend beyond mere performance metrics, touching upon the very trustworthiness and interpretability of advanced AI systems. The proposed intervention, SNR-Aware Filtering, is both practical and theoretically grounded, demonstrating a rare confluence of diagnostic rigor and actionable solution.

Recommendations

  • Future research should explore the application of MI-based diagnostics to other forms of AI model instability, such as adversarial robustness or out-of-distribution generalization.
  • Investigate more computationally efficient approximations for Mutual Information in very high-dimensional latent spaces to enhance scalability for large-scale production systems.
  • Develop standardized benchmarks and evaluation suites that explicitly incorporate metrics for cross-input distinguishability to foster more robust and reliable agentic AI development.
  • Explore the interplay between different regularization techniques and the SNR mechanism to design training regimes that intrinsically promote input-dependent reasoning and mitigate template collapse.

Sources

Original: arXiv - cs.LG