Overcoming Valid Action Suppression in Unmasked Policy Gradient Algorithms
arXiv:2603.09090v1 Announce Type: new Abstract: In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked training: it systematically suppresses valid actions at states the agent has not yet visited. This occurs because gradients pushing down invalid actions at visited states propagate through shared network parameters to unvisited states where those actions are valid. We prove that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state $s^*$, the probability $\pi(a \mid s^*)$ is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. This bound reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tr
arXiv:2603.09090v1 Announce Type: new Abstract: In reinforcement learning environments with state-dependent action validity, action masking consistently outperforms penalty-based handling of invalid actions, yet existing theory only shows that masking preserves the policy gradient theorem. We identify a distinct failure mode of unmasked training: it systematically suppresses valid actions at states the agent has not yet visited. This occurs because gradients pushing down invalid actions at visited states propagate through shared network parameters to unvisited states where those actions are valid. We prove that for softmax policies with shared features, when an action is invalid at visited states but valid at an unvisited state $s^$, the probability $\pi(a \mid s^)$ is bounded by exponential decay due to parameter sharing and the zero-sum identity of softmax logits. This bound reveals that entropy regularization trades off between protecting valid actions and sample efficiency, a tradeoff that masking eliminates. We validate empirically that deep networks exhibit the feature alignment condition required for suppression, and experiments on Craftax, Craftax-Classic, and MiniHack confirm the predicted exponential suppression and demonstrate that feasibility classification enables deployment without oracle masks.
Executive Summary
This article addresses a critical issue in reinforcement learning, specifically the suppression of valid actions in unmasked policy gradient algorithms. Researchers identify a failure mode where gradients pushing down invalid actions at visited states propagate to unvisited states, suppressing valid actions. The authors prove that for softmax policies with shared features, the probability of a valid action at an unvisited state is bounded by exponential decay. Empirical experiments confirm this suppression and demonstrate that feasibility classification enables deployment without oracle masks. The findings have significant implications for the development of more effective reinforcement learning algorithms.
Key Points
- ▸ Unmasked policy gradient algorithms suppress valid actions at unvisited states due to parameter sharing and gradient propagation.
- ▸ The probability of a valid action at an unvisited state is bounded by exponential decay.
- ▸ Feasibility classification enables deployment without oracle masks.
Merits
Strength
The article provides a clear explanation of the suppression phenomenon and its underlying causes, offering a significant contribution to the field of reinforcement learning.
Demerits
Limitation
The article assumes a specific softmax policy and feature sharing architecture, limiting the generalizability of the findings to other policy and network configurations.
Expert Commentary
This article provides a comprehensive analysis of the suppression phenomenon in unmasked policy gradient algorithms, offering a significant contribution to the field of reinforcement learning. The authors' proof of exponential decay in the probability of valid actions at unvisited states is a notable achievement, demonstrating a clear understanding of the underlying mechanisms. The empirical experiments provide strong evidence for the suppression phenomenon, and the feasibility classification approach offers a promising solution for deployment without oracle masks. The article's findings have significant implications for the development of more effective reinforcement learning algorithms, which can be used to improve decision-making in complex environments.
Recommendations
- ✓ Future research should investigate the suppression phenomenon in other policy and network configurations to further generalize the findings.
- ✓ The development of more effective reinforcement learning algorithms that can handle invalid actions without suppressing valid actions is a critical area of research.