Skip to main content
Academic

VAM: Verbalized Action Masking for Controllable Exploration in RL Post-Training -- A Chess Case Study

arXiv:2602.16833v1 Announce Type: new Abstract: Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by

Z
Zhicheng Zhang, Ziyan Wang, Yali Du, Fei Fang
· · 1 min read · 14 views

arXiv:2602.16833v1 Announce Type: new Abstract: Exploration remains a key bottleneck for reinforcement learning (RL) post-training of large language models (LLMs), where sparse feedback and large action spaces can lead to premature collapse into repetitive behaviors. We propose Verbalized Action Masking (VAM), which verbalizes an action mask in the prompt and enforces that the model outputs an action from the masked set. Building on this interface, we introduce iterative action-space pruning: if the target action is not sampled, we remove valid sampled actions from the mask and resample under the reduced candidate set, repeating until the target is sampled or a fixed budget is exhausted. We study VAM in chess and evaluate it under two training regimes: an engine-play regime that generates states via play against an engine opponent and a fixed-dataset regime that trains from a fixed dataset of positions with verifier scores. Across held-out chess puzzles and full-game play measured by average centipawn loss (ACPL), VAM improves learning efficiency and final performance over strong baselines, highlighting verbalized masking as a practical mechanism for controllable exploration in LLM RL post-training.

Executive Summary

This article introduces Verbalized Action Masking (VAM), a novel exploration method for reinforcement learning post-training of large language models. VAM verbalizes an action mask in the prompt and enforces the model to output an action from the masked set. The authors apply VAM to a chess case study, evaluating its performance under two training regimes: engine-play and fixed-dataset. Results show that VAM improves learning efficiency and final performance over strong baselines, highlighting its potential as a practical mechanism for controllable exploration in LLM RL post-training. The study demonstrates the effectiveness of VAM in a realistic and challenging environment, with implications for the broader field of reinforcement learning and language models.

Key Points

  • VAM is a novel exploration method for reinforcement learning post-training of large language models.
  • VAM verbalizes an action mask in the prompt and enforces the model to output an action from the masked set.
  • The authors apply VAM to a chess case study, evaluating its performance under two training regimes: engine-play and fixed-dataset.

Merits

Strength

The study demonstrates the effectiveness of VAM in a realistic and challenging environment, such as chess, which highlights its potential for real-world applications.

Strength

The authors provide a thorough evaluation of VAM under two training regimes, engine-play and fixed-dataset, which helps to establish the robustness of the method.

Strength

The study highlights the importance of controllable exploration in reinforcement learning post-training, which is a critical bottleneck in the field.

Demerits

Limitation

The study focuses on a single domain, chess, which may limit the generalizability of the results to other domains.

Limitation

The authors rely on a fixed-dataset regime, which may not capture the full range of scenarios that VAM might encounter in real-world applications.

Expert Commentary

The study presents a significant contribution to the field of reinforcement learning post-training, particularly in the context of large language models. The evaluation of VAM under two training regimes provides a comprehensive understanding of its effectiveness and robustness. However, the study's focus on a single domain and reliance on a fixed-dataset regime may limit its generalizability. Nevertheless, the results suggest that VAM has the potential to improve learning efficiency and final performance in reinforcement learning post-training, which could lead to significant advances in real-world applications.

Recommendations

  • Future studies should investigate the generalizability of VAM to other domains and scenarios, including real-world applications.
  • Researchers should explore the potential applications of VAM in other areas, such as recommender systems and decision-making under uncertainty.

Sources