Academic

ATPO: Adaptive Tree Policy Optimization for Multi-Turn Medical Dialogue

arXiv:2603.02216v1 Announce Type: new Abstract: Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and dive

arXiv:2603.02216v1 Announce Type: new Abstract: Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance. This strategy enables more accurate value estimation, while fostering more efficient and diverse exploration. To mitigate the high computational cost of tree-based RL, we introduce two key optimizations: an uncertainty-guided pruning mechanism to minimize the number of rollouts, and an asynchronous search architecture that leverages KV cache reuse to maximize inference throughput. Extensive experiments on three public medical dialogue benchmarks demonstrate that our algorithm significantly outperforms several strong baselines, culminating in Qwen3-8B model surpassing the much larger GPT-4o ($+0.92\%$ accuracy).

Executive Summary

This article proposes a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm for multi-turn medical dialogues. ATPO addresses challenges in large language models (LLMs) by adaptively allocating the rollout budget to states with high uncertainty. The algorithm enables accurate value estimation and fosters efficient and diverse exploration. Key optimizations include an uncertainty-guided pruning mechanism and an asynchronous search architecture. Experiments demonstrate that ATPO significantly outperforms strong baselines, with the Qwen3-8B model surpassing the GPT-4o model. The findings have significant implications for accurate diagnosis in medical dialogues and highlight the potential of ATPO in other interactive scenarios.

Key Points

  • ATPO addresses challenges in LLMs for multi-turn medical dialogues
  • Uncertainty-aware adaptive tree policy optimization enables accurate value estimation
  • Key optimizations include uncertainty-guided pruning and asynchronous search architecture

Merits

Strength: Efficient Exploration

ATPO fosters efficient and diverse exploration, leading to improved accuracy in medical dialogues

Strength: Accurate Value Estimation

The uncertainty-aware adaptive tree policy optimization enables accurate value estimation, addressing challenges in LLMs

Demerits

Limitation: Computational Cost

The high computational cost of tree-based RL remains a challenge, addressed by ATPO's pruning mechanism and asynchronous search architecture

Limitation: Dependence on Benchmark Datasets

The effectiveness of ATPO is demonstrated on public medical dialogue benchmarks, which may not generalize to other domains or datasets

Expert Commentary

The article presents a novel and promising approach to addressing challenges in LLMs for multi-turn medical dialogues. The uncertainty-aware adaptive tree policy optimization and key optimizations demonstrate the potential of ATPO in improving accuracy and efficiency. However, the high computational cost and dependence on benchmark datasets remain limitations. The implications for medical dialogues and interactive scenarios are significant, highlighting the need for further research and development of uncertainty-aware optimization techniques.

Recommendations

  • 1. Further research on applying ATPO to other interactive scenarios and domains
  • 2. Development of uncertainty-aware optimization techniques for LLMs in medical dialogues

Sources