Academic

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

arXiv:2603.19453v1 Announce Type: new Abstract: We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate

V
V\'ictor Gallego
· · 1 min read · 6 views

arXiv:2603.19453v1 Announce Type: new Abstract: We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at https://github.com/vicgalle/llm-policies-social-dilemmas.

Executive Summary

This article examines the application of large language models (LLMs) in synthesizing policies for multi-agent environments, specifically in sequential social dilemmas. The authors propose a framework that utilizes LLMs to generate Python policy functions, evaluates them through self-play, and refines them using performance feedback. The study compares the effectiveness of sparse feedback (reward only) and dense feedback (reward plus social metrics) across two LLMs and two social dilemmas. The results show that dense feedback outperforms sparse feedback in all metrics, particularly in the Cleanup public goods game. The authors also discuss the potential for LLMs to be exploited and highlight the trade-off between expressiveness and safety in policy synthesis.

Key Points

  • The authors propose a new framework for LLM policy synthesis in multi-agent environments.
  • Dense feedback (reward plus social metrics) outperforms sparse feedback (reward only) in all metrics.
  • The advantage of dense feedback is most significant in the Cleanup public goods game.
  • The authors discuss the potential for LLMs to be exploited and highlight the trade-off between expressiveness and safety.

Merits

Strength in Cooperative Strategies

The study demonstrates the ability of LLMs to generate effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression.

Insight into Feedback Engineering

The authors provide valuable insights into the design of feedback engineering, highlighting the importance of social metrics in guiding LLMs toward more effective cooperative strategies.

Demerits

Limitation in Generalizability

The study is limited in its generalizability to other LLMs and social dilemmas, and further research is needed to validate the findings in more diverse environments.

Concerns about Safety and Exploitability

The authors highlight the potential for LLMs to be exploited, which raises concerns about safety and the need for further research on mitigations.

Expert Commentary

The study provides new insights into the application of LLMs in synthesizing policies for multi-agent environments. The results demonstrate the effectiveness of dense feedback in guiding LLMs toward more effective cooperative strategies. However, the study also highlights the potential risks and limitations of LLMs, including concerns about safety and exploitability. Further research is needed to validate the findings and to develop more robust and safe LLMs for real-world applications. The study has significant implications for the development of autonomous systems and for policymakers who need to consider the potential risks and benefits of LLMs.

Recommendations

  • Future research should focus on developing more robust and safe LLMs for real-world applications.
  • Policymakers should consider the potential risks and benefits of LLMs in social dilemmas and develop strategies for mitigating these risks.

Sources

Original: arXiv - cs.CL