Skip to main content
Academic

Decoupling Strategy and Execution in Task-Focused Dialogue via Goal-Oriented Preference Optimization

arXiv:2602.15854v1 Announce Type: cross Abstract: Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in s

J
Jingyi Xu, Xingyu Ren, Zhiqiang You, Yumeng Zhang, Zhoupeng Shou
· · 1 min read · 4 views

arXiv:2602.15854v1 Announce Type: cross Abstract: Large language models show potential in task-oriented dialogue systems, yet existing training methods often rely on token-level likelihood or preference optimization, which poorly align with long-horizon task success. To address this, we propose Goal-Oriented Preference Optimization (GOPO), a hierarchical reinforcement learning framework that decouples strategy planning from response generation via an Expert Agent and a Customer Service Agent. The Expert Agent optimizes multi-turn goal preferences at the dialogue-trajectory level, while the Customer Service Agent generates responses strictly aligned with the selected strategy. We evaluate GOPO on public benchmarks and e-commerce customer service datasets, and introduce Task-focused Sequential Engagement (TSE), a sequence-level metric derived from real e-commerce interaction data. On the Mgshop dataset, GOPO improves TSE by 7.7% and 10.3% over PPO and Memento, with consistent gains in sequence-level reward and generation quality. Furthermore, a 14B model trained with GOPO achieves 2.7% and 1.5% higher TSE than Qwen-235B and GPT-5.2, respectively. Ablation studies confirm the Expert Agent's critical role in long-horizon optimization. GOPO demonstrates consistent improvements across other datasets as well. This work establishes a new paradigm for task-oriented dialogue systems in commercial scenarios, with code and datasets to be made public.

Executive Summary

This article proposes a novel framework, Goal-Oriented Preference Optimization (GOPO), for task-oriented dialogue systems. By decoupling strategy planning from response generation, GOPO enables more efficient and effective long-horizon task success. The framework consists of an Expert Agent that optimizes multi-turn goal preferences and a Customer Service Agent that generates responses aligned with the selected strategy. The authors evaluate GOPO on public benchmarks and e-commerce datasets, achieving significant improvements in sequence-level reward and generation quality. This work has the potential to revolutionize task-oriented dialogue systems in commercial scenarios.

Key Points

  • Decoupling strategy planning from response generation enables more efficient long-horizon task success
  • GOPO framework consists of an Expert Agent and a Customer Service Agent
  • Significant improvements achieved in sequence-level reward and generation quality

Merits

Strength in Long-Horizon Optimization

GOPO's ability to decouple strategy planning from response generation enables more efficient and effective long-horizon task success, addressing a significant limitation of existing methods.

Improved Generation Quality

The Customer Service Agent's strict alignment with the selected strategy leads to improved generation quality and sequence-level reward.

Robustness and Consistency

GOPO demonstrates consistent improvements across different datasets, indicating its robustness and potential for real-world applications.

Demerits

Limited Generalizability

The evaluation of GOPO is primarily conducted on e-commerce datasets, limiting its generalizability to other domains and tasks.

Computational Complexity

The proposed framework may introduce additional computational complexity due to the hierarchical reinforcement learning architecture.

Expert Commentary

The proposed GOPO framework represents a significant advancement in task-oriented dialogue systems, addressing the critical challenge of long-horizon optimization. However, its limited generalizability and potential computational complexity warrant further investigation and optimization. The results presented in this work demonstrate the potential of GOPO for real-world applications, and its implications for policy and regulatory frameworks in the field of AI and human-computer interaction are substantial.

Recommendations

  • Further evaluation of GOPO on diverse datasets and tasks to establish its generalizability
  • Investigation of methods to mitigate computational complexity and improve scalability

Sources