GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory
arXiv:2602.12316v1 Announce Type: new Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for stud
arXiv:2602.12316v1 Announce Type: new Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.
Executive Summary
The article introduces GT-HarmBench, a benchmark designed to evaluate AI safety risks in multi-agent environments using game-theoretic structures. The benchmark comprises 2,009 high-stakes scenarios drawn from realistic AI risk contexts, testing 15 frontier models. The study finds that agents choose socially beneficial actions only 62% of the time, frequently leading to harmful outcomes. It also measures sensitivity to game-theoretic prompt framing and ordering, and demonstrates that game-theoretic interventions can improve socially beneficial outcomes by up to 18%. The benchmark aims to provide a standardized testbed for studying alignment in multi-agent environments.
Key Points
- ▸ Introduction of GT-HarmBench, a benchmark for evaluating AI safety in multi-agent environments.
- ▸ Agents choose socially beneficial actions only 62% of the time, leading to harmful outcomes.
- ▸ Game-theoretic interventions improve socially beneficial outcomes by up to 18%.
- ▸ Benchmark includes 2,009 high-stakes scenarios based on game-theoretic structures.
- ▸ Provides a standardized testbed for studying alignment in multi-agent environments.
Merits
Comprehensive Benchmark
GT-HarmBench provides a comprehensive and standardized testbed for evaluating AI safety in multi-agent environments, addressing a significant gap in current AI safety benchmarks.
Realistic Scenarios
The scenarios are drawn from realistic AI risk contexts, making the benchmark highly relevant to practical applications.
Effective Interventions
The study demonstrates the effectiveness of game-theoretic interventions in improving socially beneficial outcomes, providing actionable insights for AI developers.
Demerits
Limited Model Diversity
The study tests only 15 frontier models, which may not be representative of the broader range of AI models in use.
Potential Bias in Scenarios
The scenarios are based on game-theoretic structures, which may introduce biases that could affect the generalizability of the findings.
Focus on High-Stakes Scenarios
The focus on high-stakes scenarios may not capture the full spectrum of AI safety risks, potentially limiting the applicability of the benchmark.
Expert Commentary
The introduction of GT-HarmBench represents a significant advancement in the field of AI safety, particularly in the context of multi-agent environments. The benchmark addresses a critical gap in current AI safety evaluations by focusing on scenarios that involve coordination and conflict among multiple agents. The findings highlight substantial reliability gaps in current AI models, emphasizing the need for more robust and standardized evaluation methods. The study's demonstration of the effectiveness of game-theoretic interventions provides valuable insights for both AI developers and policymakers. However, the limitations regarding model diversity and the potential bias in scenarios should be carefully considered. Future research should aim to expand the benchmark to include a more diverse range of models and scenarios, ensuring its applicability across different AI systems and contexts. Overall, GT-HarmBench is a valuable tool for advancing our understanding of AI safety in multi-agent environments and guiding the development of safer and more reliable AI systems.
Recommendations
- ✓ Expand the benchmark to include a more diverse range of AI models to ensure broader applicability.
- ✓ Investigate the potential biases in the scenarios and develop methods to mitigate these biases.
- ✓ Explore the use of GT-HarmBench in real-world applications to validate its effectiveness in practical settings.
- ✓ Encourage further research on game-theoretic interventions to enhance their effectiveness in promoting socially beneficial outcomes.