From Flat Logs to Causal Graphs: Hierarchical Failure Attribution for LLM-based Multi-Agent Systems
arXiv:2602.23701v1 Announce Type: new Abstract: LLM-powered Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex domains but suffer from inherent fragility and opaque failure mechanisms. Existing failure attribution methods, whether relying on direct prompting, costly replays, or supervised fine-tuning, typically treat execution logs as flat sequences. This linear perspective fails to disentangle the intricate causal links inherent to MAS, leading to weak observability and ambiguous responsibility boundaries. To address these challenges, we propose CHIEF, a novel framework that transforms chaotic trajectories into a structured hierarchical causal graph. It then employs hierarchical oracle-guided backtracking to efficiently prune the search space via sybthesized virtual oracles. Finally, it implements counterfactual attribution via a progressive causal screening strategy to rigorously distinguish true root causes from propagated symptoms. Experiments on Who&Wh
arXiv:2602.23701v1 Announce Type: new Abstract: LLM-powered Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in complex domains but suffer from inherent fragility and opaque failure mechanisms. Existing failure attribution methods, whether relying on direct prompting, costly replays, or supervised fine-tuning, typically treat execution logs as flat sequences. This linear perspective fails to disentangle the intricate causal links inherent to MAS, leading to weak observability and ambiguous responsibility boundaries. To address these challenges, we propose CHIEF, a novel framework that transforms chaotic trajectories into a structured hierarchical causal graph. It then employs hierarchical oracle-guided backtracking to efficiently prune the search space via sybthesized virtual oracles. Finally, it implements counterfactual attribution via a progressive causal screening strategy to rigorously distinguish true root causes from propagated symptoms. Experiments on Who&When benchmark show that CHIEF outperforms eight strong and state-of-the-art baselines on both agent- and step-level accuracy. Ablation studies further confirm the critical role of each proposed module.
Executive Summary
This article presents CHIEF, a novel framework for hierarchical failure attribution in Large Language Model (LLM)-based Multi-Agent Systems (MAS). CHIEF addresses the limitations of existing methods by transforming chaotic execution logs into structured causal graphs, employing hierarchical oracle-guided backtracking, and implementing counterfactual attribution via progressive causal screening. Experiments demonstrate CHIEF's superiority over state-of-the-art baselines on both agent- and step-level accuracy. The framework's modules are critical to its performance, as confirmed by ablation studies. CHIEF has significant implications for the development and deployment of LLM-based MAS, enabling more robust and transparent failure analysis.
Key Points
- ▸ CHIEF transforms chaotic execution logs into structured causal graphs
- ▸ Hierarchical oracle-guided backtracking efficiently prunes the search space
- ▸ Counterfactual attribution via progressive causal screening rigorously distinguishes true root causes
Merits
Strength in Addressing Complexity
CHIEF effectively addresses the complexity of LLM-based MAS by incorporating hierarchical causal graph analysis and counterfactual attribution, making it a significant improvement over existing methods.
Improved Accuracy
CHIEF demonstrates superior performance over state-of-the-art baselines on both agent- and step-level accuracy, indicating its practical utility in real-world applications.
Enhanced Transparency
CHIEF's hierarchical failure attribution enables more transparent analysis of LLM-based MAS, facilitating better understanding and improvement of these complex systems.
Demerits
Potential Computational Burden
The hierarchical oracle-guided backtracking and counterfactual attribution components of CHIEF may introduce computational overhead, potentially limiting its scalability in large-scale applications.
Dependence on High-Quality Training Data
The performance of CHIEF may be sensitive to the quality of training data, which could impact the generalizability and robustness of the framework in real-world scenarios.
Expert Commentary
CHIEF represents a significant advancement in the field of LLM-based MAS, addressing critical limitations of existing methods. The framework's novel approach to hierarchical failure attribution and counterfactual attribution has the potential to transform the way we design and deploy these complex systems. However, it is essential to carefully consider the potential computational burden and dependence on high-quality training data, as these factors may impact the framework's scalability and generalizability. As CHIEF continues to evolve, it is likely to have far-reaching implications for the development of explainable AI systems, causal reasoning in complex systems, and the responsible deployment of LLM-based MAS.
Recommendations
- ✓ Further research is needed to evaluate the scalability and generalizability of CHIEF in large-scale applications and diverse domains.
- ✓ The development of tools and methodologies for interpreting and visualizing the causal graphs generated by CHIEF would facilitate its practical adoption and enhance its transparency.