Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction
arXiv:2603.04783v1 Announce Type: new Abstract: While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA
arXiv:2603.04783v1 Announce Type: new Abstract: While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.
Executive Summary
This article proposes a novel training approach, Reinforcement Learning with Single-Turn Anchors (RLSTA), to address the issue of Contextual Inertia in Large Language Models (LLMs) during multi-turn interactions. Contextual Inertia refers to the phenomenon where LLMs rigidly adhere to previous reasoning traces, ignoring new information or corrections. RLSTA leverages the model's single-turn capabilities as stable internal anchors to provide reward signals, enabling the model to break Contextual Inertia and self-calibrate its reasoning based on the latest information. Experiments demonstrate that RLSTA outperforms standard fine-tuning and abstention-based methods, showcasing strong cross-domain generalization and potential for general-domain applications. This breakthrough has significant implications for the development of more robust and adaptive LLMs.
Key Points
- ▸ Contextual Inertia: a phenomenon where LLMs rigidly adhere to previous reasoning traces
- ▸ RLSTA: a novel training approach to address Contextual Inertia
- ▸ Single-turn anchors as reward signals for multi-turn interactions
Merits
Strength
RLSTA demonstrates strong cross-domain generalization and outperforms standard fine-tuning and abstention-based methods.
Demerits
Limitation
The method relies on the model's single-turn capabilities, which may not be applicable to all domains or scenarios.
Expert Commentary
The article presents a thought-provoking solution to the issue of Contextual Inertia in LLMs. The proposed RLSTA approach is well-designed and demonstrates impressive results. However, further research is needed to explore the limitations and potential biases of this method. Additionally, the article's focus on single-turn anchors as reward signals raises interesting questions about the role of human feedback and the importance of domain-specific knowledge in training LLMs. Overall, this article is a significant contribution to the field of natural language processing and has the potential to shape the future of LLM development.
Recommendations
- ✓ Further research is needed to explore the limitations and potential biases of RLSTA
- ✓ Investigation of the role of human feedback and domain-specific knowledge in training LLMs using RLSTA