Academic

Territory Paint Wars: Diagnosing and Mitigating Failure Modes in Competitive Multi-Agent PPO

arXiv:2604.04983v1 Announce Type: new Abstract: We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for $84{,}000$ episodes achieves only $26.8\%$ win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes -- reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection -- each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from $73.5\%$ to $21.6\%$. Critically, this failure is undetectab

D
Diyansha Singh
· · 1 min read · 10 views

arXiv:2604.04983v1 Announce Type: new Abstract: We present Territory Paint Wars, a minimal competitive multi-agent reinforcement learning environment implemented in Unity, and use it to systematically investigate failure modes of Proximal Policy Optimisation (PPO) under self-play. A first agent trained for $84{,}000$ episodes achieves only $26.8\%$ win rate against a uniformly-random opponent in a symmetric zero-sum game. Through controlled ablations we identify five implementation-level failure modes -- reward-scale imbalance, missing terminal signal, ineffective long-horizon credit assignment, unnormalised observations, and incorrect win detection -- each of which contributes critically to this failure in this setting. After correcting these issues, we uncover a distinct emergent pathology: competitive overfitting, where co-adapting agents maintain stable self-play performance while generalisation win rate collapses from $73.5\%$ to $21.6\%$. Critically, this failure is undetectable via standard self-play metrics: both agents co-adapt equally, so the self-play win rate remains near $50\%$ throughout the collapse. We propose a minimal intervention -- opponent mixing, where $20\%$ of training episodes substitute a fixed uniformly-random policy for the co-adaptive opponent -- which mitigates competitive overfitting and restores generalisation to $77.1\%$ ($\pm 12.6\%$, $10$ seeds) without population-based training or additional infrastructure. We open-source Territory Paint Wars to provide a reproducible benchmark for studying competitive MARL failure modes.

Executive Summary

This paper introduces Territory Paint Wars, a minimal competitive multi-agent reinforcement learning (MARL) environment, to diagnose failure modes in Proximal Policy Optimization (PPO) under self-play conditions. Despite extensive training, an agent achieves only a 26.8% win rate against a random opponent, revealing five critical implementation-level issues: reward-scale imbalance, missing terminal signals, ineffective long-horizon credit assignment, unnormalized observations, and incorrect win detection. Correcting these flaws uncovers a novel pathology—competitive overfitting—where self-play metrics mask a catastrophic collapse in generalization performance (from 73.5% to 21.6%). The authors propose a simple yet effective mitigation—opponent mixing—which restores generalization to 77.1% without requiring population-based training. By open-sourcing the environment, the study provides a reproducible benchmark for studying MARL failure modes, advancing both methodological rigor and practical insights in competitive AI systems.

Key Points

  • Failure of PPO in self-play settings despite extensive training, achieving only 26.8% win rate against a random opponent.
  • Identification of five critical implementation-level failure modes: reward-scale imbalance, missing terminal signals, ineffective long-horizon credit assignment, unnormalized observations, and incorrect win detection.
  • Discovery of competitive overfitting, where co-adapting agents maintain stable self-play performance (50% win rate) while generalization collapses (from 73.5% to 21.6%).
  • Proposal of opponent mixing (substituting 20% of training episodes with a fixed random policy) as a minimal intervention to mitigate competitive overfitting and restore generalization to 77.1%.
  • Open-sourcing of Territory Paint Wars to enable reproducible research on competitive MARL failure modes.

Merits

Novel Benchmark for MARL Research

The introduction of Territory Paint Wars provides a minimal, reproducible, and controlled environment for studying competitive MARL failure modes, addressing a critical gap in the literature where such environments are often proprietary or overly complex.

Systematic Diagnosis of Implementation-Level Failures

The paper conducts rigorous ablations to isolate five critical implementation-level issues in PPO that significantly impair performance in self-play, offering actionable insights for practitioners and researchers.

Discovery of Competitive Overfitting

The identification of competitive overfitting—a previously undocumented pathology where co-adapting agents fail to generalize despite stable self-play performance—advances the theoretical understanding of MARL dynamics and highlights the limitations of standard evaluation metrics.

Minimal and Effective Mitigation Strategy

The opponent mixing intervention is elegant in its simplicity, requiring no additional infrastructure or population-based training, yet effectively restores generalization performance, demonstrating practical utility.

Open-Source Contribution

By open-sourcing Territory Paint Wars, the authors enable reproducibility and further research, fostering collaboration and accelerating progress in the field of competitive MARL.

Demerits

Limited Generalizability of Findings

The study focuses on a minimal and symmetric zero-sum game, which may not fully capture the complexity of real-world competitive environments. The generalizability of the identified failure modes and the proposed mitigation to more complex or asymmetric settings remains to be validated.

Dependence on PPO and Unity

The analysis is constrained to PPO and the Unity-based environment, which may limit the applicability of the findings to other reinforcement learning algorithms or simulation frameworks. A more diverse set of algorithms and environments would strengthen the conclusions.

Opponent Mixing as a Partial Solution

While opponent mixing effectively mitigates competitive overfitting, it introduces a trade-off by reducing the agent's adaptability to co-adapting opponents. The long-term effects of this intervention on the agent's competitive performance in dynamic environments are not fully explored.

Lack of Comparative Analysis with Population-Based Methods

The paper does not compare opponent mixing with established population-based training methods (e.g., PSRO, LPG), which are commonly used to address similar issues in MARL. A comparative analysis would provide a clearer understanding of the relative merits of the proposed approach.

Expert Commentary

This paper makes a significant contribution to the field of competitive multi-agent reinforcement learning by systematically diagnosing implementation-level failure modes in PPO and uncovering a novel pathology—competitive overfitting—that has critical implications for both research and practice. The authors' rigorous ablations reveal that seemingly minor implementation details, such as reward scaling and terminal signals, can have outsized effects on performance, underscoring the need for meticulous engineering in RL systems. The discovery of competitive overfitting is particularly insightful, as it challenges the conventional wisdom that stable self-play metrics are sufficient indicators of robustness. This phenomenon highlights the brittleness of co-adapting agents and the importance of incorporating diverse and non-adaptive opponents in evaluation protocols. The proposed opponent mixing intervention is a masterstroke of simplicity, offering a practical solution without the complexity of population-based methods. However, the study's focus on a minimal environment and PPO may limit the generalizability of its findings, and further research is needed to validate these insights across a broader range of algorithms and settings. Overall, this work advances both the theoretical understanding and practical toolkit of competitive MARL, setting a new standard for rigor and reproducibility in the field.

Recommendations

  • Extend the analysis to include a broader range of reinforcement learning algorithms (e.g., SAC, TD3) and more complex environments to validate the generalizability of the identified failure modes and mitigation strategies.
  • Develop standardized evaluation protocols for competitive MARL that incorporate diverse and non-adaptive opponents to detect pathologies like competitive overfitting, ensuring more robust and reliable agent performance.
  • Investigate the long-term effects of opponent mixing on agent adaptability and explore hybrid approaches that combine opponent mixing with population-based training to balance generalization and adaptability.
  • Encourage the open-sourcing of additional minimal MARL environments to diversify the benchmark landscape and enable cross-environment validation of findings.
  • Explore the application of the diagnosed failure modes and mitigation strategies to real-world competitive scenarios, such as autonomous vehicle coordination or cybersecurity simulations, to assess their practical utility.

Sources

Original: arXiv - cs.LG