Academic

MAPO: Mixed Advantage Policy Optimization for Long-Horizon Multi-Turn Dialogue

arXiv:2603.06194v1 Announce Type: new Abstract: Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while na\"ive turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA,

arXiv:2603.06194v1 Announce Type: new Abstract: Subjective multi-turn dialogue tasks, such as emotional support, require conversational policies that adapt to evolving user states and optimize long-horizon interaction quality. However, reinforcement learning (RL) for such settings remains challenging due to the absence of reliable process supervision. Outcome-only training collapses credit assignment across turns into a single trajectory-level reward, while na\"ive turn-level group sampling incurs prohibitive rollout costs in interactive environments. We propose a critic-free and efficient RL algorithm named MAPO that leverages dense process feedback from a judge model and propagates long-horizon effects through Monte Carlo returns. To stabilize optimization, we introduce a mixed advantage estimator that combines turn-level normalization with batch-level normalization, enabling fine-grained yet scalable credit assignment. Across multiple subjective dialogue benchmarks, including EMPA, EmoBench, and EQ-Bench, and model scales ranging from 7B to 32B, our method consistently improves both training stability and final performance over outcome-only GRPO and single-level normalization baselines. On EMPA, we improve rates by up to 9 points and increase dialogue scores by as much as +43.2 over the 7B base model. Despite training only on EMPA-style environments, our approach generalizes well, yielding consistent improvements on unseen emotional-intelligence benchmarks, including up to +4 points on EmoBench and +3.5 on EQ-Bench. Together, these results demonstrate that dense process supervision combined with mixed-level normalization enables effective and scalable RL for subjective, open-ended multi-turn dialogue.

Executive Summary

The article introduces MAPO, a novel reinforcement learning algorithm for long-horizon multi-turn dialogue tasks. MAPO leverages dense process feedback and mixed advantage estimation to improve training stability and performance. The approach is evaluated on multiple benchmarks, demonstrating significant improvements over baseline methods. The results highlight the effectiveness of MAPO in optimizing conversational policies for subjective, open-ended dialogue tasks.

Key Points

  • MAPO algorithm for long-horizon multi-turn dialogue tasks
  • Leverages dense process feedback from a judge model
  • Mixed advantage estimator for stabilization and credit assignment

Merits

Improved Training Stability

MAPO's mixed advantage estimator enables fine-grained credit assignment, leading to more stable training and improved performance.

Scalability

The algorithm's efficiency and critic-free design make it suitable for large-scale models and complex dialogue tasks.

Demerits

Limited Exploration

The reliance on a judge model for process feedback may limit the exploration of novel dialogue strategies and outcomes.

Benchmark Dependence

The evaluation of MAPO is primarily based on specific benchmarks, which may not generalize to all types of dialogue tasks or environments.

Expert Commentary

The introduction of MAPO marks a significant advancement in reinforcement learning for long-horizon multi-turn dialogue tasks. By leveraging dense process feedback and mixed advantage estimation, MAPO addresses key challenges in credit assignment and training stability. The algorithm's scalability and effectiveness in optimizing conversational policies have important implications for the development of more engaging and supportive dialogue systems. However, further research is needed to fully explore the potential of MAPO and address potential limitations, such as the reliance on a judge model and benchmark dependence.

Recommendations

  • Further evaluation of MAPO on diverse dialogue tasks and benchmarks
  • Investigation of alternative methods for process feedback and credit assignment
  • Integration of MAPO with other reinforcement learning algorithms and techniques to enhance performance and scalability

Sources