Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
arXiv:2603.04597v1 Announce Type: new Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and r
arXiv:2603.04597v1 Announce Type: new Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.
Executive Summary
The article proposes GOLF, a novel reinforcement learning framework that leverages group-level natural language feedback to enhance exploration efficiency. By aggregating external critiques and intra-group attempts, GOLF generates high-quality refinements that guide targeted exploration, resulting in improved performance and sample efficiency. The framework jointly optimizes generation and refinement within a unified RL loop, achieving a 2.2x improvement in sample efficiency compared to traditional RL methods.
Key Points
- ▸ GOLF framework utilizes group-level language feedback
- ▸ Aggregation of external critiques and intra-group attempts
- ▸ Adaptive injection of refinements as off-policy scaffolds
Merits
Improved Exploration Efficiency
GOLF's ability to leverage group-level language feedback leads to more efficient exploration and improved performance
Demerits
Limited Generalizability
GOLF's effectiveness may be limited to specific domains or environments where high-quality language feedback is available
Expert Commentary
The proposed GOLF framework represents a significant advancement in reinforcement learning, as it effectively harnesses the power of group-level language feedback to guide exploration. The adaptive injection of refinements as off-policy scaffolds is a particularly innovative aspect of the framework, allowing for targeted guidance in sparse-reward regions. However, further research is needed to fully understand the limitations and potential applications of GOLF, particularly in domains with limited access to high-quality language feedback.
Recommendations
- ✓ Further investigation into the generalizability of GOLF across different domains and environments
- ✓ Exploration of the potential applications of GOLF in real-world problems with sparse rewards