Academic

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

arXiv:2602.12316v1 Announce Type: new Abstract: Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 2,009 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents choose socially beneficial actions in only 62% of cases, frequently leading to harmful outcomes. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for stud

Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin · March 7, 2026 · 1 min read · 23 views

#cs.AI #cs.CL #cs.CY #cs.GT #cs.MA

Executive Summary

The article introduces GT-HarmBench, a benchmark designed to evaluate AI safety risks in multi-agent environments using game-theoretic structures. The benchmark comprises 2,009 high-stakes scenarios drawn from realistic AI risk contexts, testing 15 frontier models. The study finds that agents choose socially beneficial actions only 62% of the time, frequently leading to harmful outcomes. It also measures sensitivity to game-theoretic prompt framing and ordering, and demonstrates that game-theoretic interventions can improve socially beneficial outcomes by up to 18%. The benchmark aims to provide a standardized testbed for studying alignment in multi-agent environments.

Key Points

▸ Introduction of GT-HarmBench, a benchmark for evaluating AI safety in multi-agent environments.
▸ Agents choose socially beneficial actions only 62% of the time, leading to harmful outcomes.
▸ Game-theoretic interventions improve socially beneficial outcomes by up to 18%.
▸ Benchmark includes 2,009 high-stakes scenarios based on game-theoretic structures.
▸ Provides a standardized testbed for studying alignment in multi-agent environments.

Merits

Comprehensive Benchmark

GT-HarmBench provides a comprehensive and standardized testbed for evaluating AI safety in multi-agent environments, addressing a significant gap in current AI safety benchmarks.

Realistic Scenarios

The scenarios are drawn from realistic AI risk contexts, making the benchmark highly relevant to practical applications.

Effective Interventions

The study demonstrates the effectiveness of game-theoretic interventions in improving socially beneficial outcomes, providing actionable insights for AI developers.

Demerits

Limited Model Diversity

The study tests only 15 frontier models, which may not be representative of the broader range of AI models in use.

Potential Bias in Scenarios

The scenarios are based on game-theoretic structures, which may introduce biases that could affect the generalizability of the findings.

Focus on High-Stakes Scenarios

The focus on high-stakes scenarios may not capture the full spectrum of AI safety risks, potentially limiting the applicability of the benchmark.

Expert Commentary

The introduction of GT-HarmBench represents a significant advancement in the field of AI safety, particularly in the context of multi-agent environments. The benchmark addresses a critical gap in current AI safety evaluations by focusing on scenarios that involve coordination and conflict among multiple agents. The findings highlight substantial reliability gaps in current AI models, emphasizing the need for more robust and standardized evaluation methods. The study's demonstration of the effectiveness of game-theoretic interventions provides valuable insights for both AI developers and policymakers. However, the limitations regarding model diversity and the potential bias in scenarios should be carefully considered. Future research should aim to expand the benchmark to include a more diverse range of models and scenarios, ensuring its applicability across different AI systems and contexts. Overall, GT-HarmBench is a valuable tool for advancing our understanding of AI safety in multi-agent environments and guiding the development of safer and more reliable AI systems.

Recommendations

✓ Expand the benchmark to include a more diverse range of AI models to ensure broader applicability.
✓ Investigate the potential biases in the scenarios and develop methods to mitigate these biases.
✓ Explore the use of GT-HarmBench in real-world applications to validate its effectiveness in practical settings.
✓ Encourage further research on game-theoretic interventions to enhance their effectiveness in promoting socially beneficial outcomes.

Sources

arXiv - cs.AI

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

AI Commentary

Executive Summary

Key Points

Merits

Comprehensive Benchmark

Realistic Scenarios

Effective Interventions

Demerits

Limited Model Diversity

Potential Bias in Scenarios

Focus on High-Stakes Scenarios

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.