Academic

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi · March 5, 2026 · 1 min read · 1 views

#cs.CL

arXiv:2603.02701v1 Announce Type: new Abstract: Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.

Executive Summary

The article introduces Graph-GRPO, a novel framework addressing longstanding challenges in communication topology optimization for multi-agent systems powered by Large Language Models. Traditional approaches suffer from gradient variance and credit assignment issues due to reliance on absolute rewards tied to single-sample evaluations. Graph-GRPO mitigates these by introducing Group Relative Policy Optimization, which samples multiple diverse graphs per query and evaluates edge performance relative to within-group benchmarks, thereby normalizing rewards and improving signal clarity. Experimental results on reasoning and code generation tasks demonstrate superior stability and discovery of previously obscured communication pathways. This represents a meaningful advancement in scalable, robust MAS topology learning.

Key Points

▸ Introduction of Group Relative Policy Optimization as a novel mechanism
▸ Mitigation of gradient variance via relative reward normalization
▸ Empirical validation on real-world benchmarks showing superiority over baselines

Merits

Innovation

Graph-GRPO introduces a statistically grounded, scalable solution to persistent credit assignment and variance issues in MAS topology learning.

Empirical Support

Extensive experiments validate the efficacy of the proposed method across multiple domains, indicating strong generalizability.

Demerits

Complexity

The requirement to sample multiple diverse graphs per query may introduce computational overhead in large-scale deployments.

Generalizability Concern

Results are currently validated on reasoning and code generation benchmarks; applicability to other LLM-based MAS domains remains to be confirmed.

Expert Commentary

Graph-GRPO represents a paradigm shift in how multi-agent systems evaluate communication topology. By shifting from absolute to relative evaluation—leveraging group-level comparisons to inform edge-level learning—the authors effectively circumvent the classic pitfalls of single-sample gradient variance and ambiguous credit assignment. This approach aligns with recent advances in contrastive learning and relative evaluation frameworks in machine learning, suggesting a broader applicability beyond MAS. The elegance lies in its simplicity: instead of attempting to optimize absolute outcomes, it optimizes relative performance within a sampled cohort, which introduces robustness against noise and enhances learning signal quality. Notably, the experiments are compelling, particularly the identification of ‘critical communication pathways’ previously masked by reward noise—this is a significant empirical contribution. While computational cost may be a practical concern, the tradeoff appears justified given the qualitative gains in stability and interpretability. This work sets a new benchmark for topology learning in agent-based systems and warrants replication across diverse LLM architectures and application domains.

Recommendations

✓ Adopt Graph-GRPO as a baseline for new MAS topology optimization projects involving LLM agents.
✓ Investigate scalability of the sampling mechanism on distributed architectures and explore hybrid variants combining relative and absolute reward signals.

Sources

arXiv - cs.CL

Something extraordinary is coming.

Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization

AI Commentary

Executive Summary

Key Points

Merits

Innovation

Empirical Support

Demerits

Complexity

Generalizability Concern

Expert Commentary

Recommendations

Sources

Related Articles

ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for …

HateMirage: An Explainable Multi-Dimensional Dataset for Decoding Faux Hate and …

Sensory-Aware Sequential Recommendation via Review-Distilled Representations

Efficient Self-Evaluation for Diffusion Language Models via Sequence Regeneration

JCG, PC

HSOLLC Co., Ltd.