Not All Turns Are Equally Hard: Adaptive Thinking Budgets For Efficient Multi-Turn Reasoning
arXiv:2604.05164v1 Announce Type: new Abstract: As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning.In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoni
arXiv:2604.05164v1 Announce Type: new Abstract: As LLM reasoning performance plateau, improving inference-time compute efficiency is crucial to mitigate overthinking and long thinking traces even for simple queries. Prior approaches including length regularization, adaptive routing, and difficulty-based budget allocation primarily focus on single-turn settings and fail to address the sequential dependencies inherent in multi-turn reasoning.In this work, we formulate multi-turn reasoning as a sequential compute allocation problem and model it as a multi-objective Markov Decision Process. We propose TAB: Turn-Adaptive Budgets, a budget allocation policy trained via Group Relative Policy Optimization (GRPO) that learns to maximize task accuracy while respecting global per-problem token constraints. Consequently, TAB takes as input the conversation history and learns to adaptively allocate smaller budgets to easier turns and save appropriate number of tokens for the crucial harder reasoning steps. Our experiments on mathematical reasoning benchmarks demonstrate that TAB achieves a superior accuracy-tokens tradeoff saving up to 35% tokens while maintaining accuracy over static and off-the-shelf LLM budget baselines. Further, for systems where a plan of all sub-questions is available apriori, we propose TAB All-SubQ, a budget allocation policy that budgets tokens based on the conversation history and all past and future sub-questions saving up to 40% tokens over baselines.
Executive Summary
This paper introduces a novel framework, Turn-Adaptive Budgets (TAB), to optimize inference-time compute efficiency in multi-turn reasoning tasks by modeling the problem as a multi-objective Markov Decision Process (MDP). Unlike prior approaches that focus on single-turn settings, TAB dynamically allocates computational resources across conversation turns, reserving tokens for harder reasoning steps while minimizing overthinking in simpler turns. Trained via Group Relative Policy Optimization (GRPO), TAB achieves superior accuracy-tokens tradeoffs, reducing token usage by up to 35% while maintaining accuracy on mathematical reasoning benchmarks. For scenarios with prior knowledge of sub-questions, TAB All-SubQ further improves efficiency, saving up to 40% tokens. The work addresses a critical gap in multi-turn reasoning by incorporating sequential dependencies and adaptive compute allocation, offering a scalable solution for enhancing efficiency in large language model (LLM) systems.
Key Points
- ▸ Introduces TAB, a dynamic budget allocation policy for multi-turn reasoning that adapts token allocation based on conversation history and task difficulty.
- ▸ Formulates the problem as a multi-objective MDP, enabling the model to optimize both accuracy and token efficiency under global constraints.
- ▸ Demonstrates significant improvements in token savings (35-40%) over static and off-the-shelf baselines while maintaining or improving task accuracy on mathematical reasoning benchmarks.
- ▸ Proposes TAB All-SubQ, a variant that leverages prior knowledge of sub-questions to further enhance efficiency, achieving up to 40% token savings.
Merits
Innovative Problem Framing
The paper effectively frames multi-turn reasoning as a sequential compute allocation problem, addressing a critical gap in existing literature that predominantly focuses on single-turn settings.
Methodological Rigor
The use of a multi-objective MDP and GRPO for training the TAB policy demonstrates a sophisticated and theoretically grounded approach to optimizing inference-time compute efficiency.
Empirical Validation
The experiments on mathematical reasoning benchmarks provide robust evidence of TAB's superiority over static and off-the-shelf baselines, with substantial token savings and maintained accuracy.
Practical Applicability
The proposed TAB and TAB All-SubQ policies are designed to be deployable in real-world LLM systems, offering a scalable solution for improving efficiency in multi-turn reasoning tasks.
Demerits
Limited Generalizability
The experiments are primarily conducted on mathematical reasoning benchmarks, which may not fully capture the complexity and variability of other multi-turn reasoning tasks, such as legal or medical reasoning.
Dependency on Prior Knowledge
The TAB All-SubQ variant relies on prior knowledge of sub-questions, which may not always be available in real-world scenarios, limiting its applicability in some contexts.
Computational Overhead
While TAB aims to reduce inference-time compute, the training process for the GRPO-based policy may introduce additional computational overhead, potentially offsetting some of the efficiency gains during deployment.
Benchmark Limitations
The use of mathematical reasoning benchmarks may not fully reflect the nuances of multi-turn reasoning in domains requiring nuanced understanding, such as humanities or social sciences.
Expert Commentary
The authors present a compelling and timely contribution to the field of LLM reasoning, addressing a critical gap in multi-turn inference-time optimization. By framing the problem as a multi-objective MDP and leveraging GRPO for training, they offer a theoretically sound and empirically validated solution that outperforms static and off-the-shelf baselines. The introduction of TAB and TAB All-SubQ demonstrates a nuanced understanding of the tradeoffs between accuracy and efficiency, with significant token savings achieved without sacrificing performance. While the work is primarily validated on mathematical reasoning tasks, the underlying principles are broadly applicable to other domains where multi-turn reasoning is essential. The paper also raises important questions about the scalability and generalizability of the approach, particularly in domains requiring nuanced understanding or where prior knowledge of sub-questions is unavailable. Overall, this work represents a significant step forward in optimizing LLM systems, with far-reaching implications for both academia and industry.
Recommendations
- ✓ Conduct further experiments across diverse multi-turn reasoning benchmarks, including domains such as legal reasoning, medical diagnosis, and humanities, to validate the generalizability of TAB.
- ✓ Explore alternative training methodologies, such as offline reinforcement learning or model-based approaches, to reduce the computational overhead associated with training the GRPO-based policy while maintaining performance.
- ✓ Develop standardized evaluation metrics and benchmarks for multi-turn reasoning tasks to enable fair comparison across methods and facilitate reproducibility.
- ✓ Investigate the integration of TAB with other inference-time optimization techniques, such as speculative decoding or early-exit mechanisms, to further enhance efficiency and performance.
- ✓ Collaborate with industry partners to deploy TAB in real-world LLM systems and gather user feedback to refine the policy for practical applications.
Sources
Original: arXiv - cs.LG