HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents
arXiv:2602.16165v1 Announce Type: new Abstract: Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structur
arXiv:2602.16165v1 Announce Type: new Abstract: Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.
Executive Summary
The article proposes HiPER, a novel Hierarchical Plan-Execute reinforcement learning framework that addresses the challenges of training large language model (LLM) agents for multi-turn decision-making tasks. By separating high-level planning from low-level execution, HiPER improves the credit assignment problem in sparse-reward settings. The framework introduces hierarchical advantage estimation (HAE), which provides an unbiased gradient estimator and reduces variance. Empirical results show that HiPER achieves state-of-the-art performance on challenging interactive benchmarks. The article highlights the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.
Key Points
- ▸ HiPER separates high-level planning from low-level execution.
- ▸ Hierarchical advantage estimation (HAE) is introduced to align optimization with the structure.
- ▸ HiPER achieves state-of-the-art performance on challenging interactive benchmarks.
Merits
Strength
The HiPER framework provides a novel approach to the credit assignment problem in sparse-reward settings, leading to improved optimization and efficiency in RL training.
Demerits
Limitation
The framework relies on the assumption of a clear separation between high-level planning and low-level execution, which may not be applicable to all multi-turn decision-making tasks.
Expert Commentary
The article presents a timely and important contribution to the field of reinforcement learning, addressing a critical challenge in the training of large language model agents. The HiPER framework's ability to improve optimization and efficiency in sparse-reward settings is a significant advancement. The empirical results demonstrate the framework's effectiveness on challenging interactive benchmarks. However, the framework's reliance on a clear separation between high-level planning and low-level execution may limit its applicability to certain tasks. Further research is needed to explore the potential generalizability of HiPER to a broader range of tasks and environments.
Recommendations
- ✓ Future research should investigate the extension of HiPER to more complex tasks and environments, including those with unclear separations between planning and execution.
- ✓ The development of more efficient and scalable RL methods like HiPER can inform policy decisions related to the deployment of AI systems in real-world applications.