Academic

Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

arXiv:2603.10476v1 Announce Type: new Abstract: The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the

arXiv:2603.10476v1 Announce Type: new Abstract: The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.

Executive Summary

This article presents a novel approach to aligning large language models (LLMs) in multi-stakeholder settings through a negotiation-based framework. The proposed framework enables LLMs to engage in structured turn-based dialogue to synthesize mutually beneficial solutions, thereby improving conflict-resolution capabilities. The authors optimize the policy via Reinforcement Learning from Human Feedback (RLHF) and demonstrate that the resulting model achieves Collective Agency (CA) alignment comparable to a single-agent baseline while improving conflict-resolution performance. This work has significant implications for the development of LLMs that support collective decision-making in value-conflict scenarios. The authors' innovative use of self-play instances and dialogue tokens to directly improve deliberative interaction dynamics is a notable contribution to the field.

Key Points

  • The proposed framework enables LLMs to engage in structured turn-based dialogue to synthesize mutually beneficial solutions.
  • The authors optimize the policy via RLHF and demonstrate improved conflict-resolution capabilities.
  • The resulting model achieves CA alignment comparable to a single-agent baseline.

Merits

Strength in multi-stakeholder settings

The proposed framework effectively addresses the limitations of existing alignment approaches in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required.

Improved conflict-resolution capabilities

The authors demonstrate that the resulting model achieves significant improvements in conflict-resolution performance without degrading general language capabilities.

Demerits

Scalability limitations

The proposed framework may not be scalable to larger or more complex multi-stakeholder settings, which could limit its applicability in real-world scenarios.

Dependence on human feedback

The authors rely on RLHF to optimize the policy, which may not be feasible in all scenarios, particularly those where human feedback is limited or unreliable.

Expert Commentary

The proposed framework represents a significant step forward in the development of LLMs that can effectively navigate complex multi-stakeholder settings. While there are limitations to the approach, including scalability and dependence on human feedback, the authors' innovative use of self-play instances and dialogue tokens to directly improve deliberative interaction dynamics is a notable contribution to the field. The work has significant implications for the development of value-aligned AI systems and highlights the need for policymakers to develop guidelines and regulations that address the complex multi-stakeholder settings in which LLMs will operate.

Recommendations

  • Future research should focus on scaling the proposed framework to larger or more complex multi-stakeholder settings, including the development of more efficient and effective optimization methods.
  • The authors should explore the applicability of the proposed framework to other AI systems, including those that operate in more complex or dynamic environments.

Sources