Skip to main content
Academic

GREPO: A Benchmark for Graph Neural Networks on Repository-Level Bug Localization

arXiv:2602.13921v1 Announce Type: new Abstract: Repository-level bug localization-the task of identifying where code must be modified to fix a bug-is a critical software engineering challenge. Standard Large Language Modles (LLMs) are often unsuitable for this task due to context window limitations that prevent them from processing entire code repositories. As a result, various retrieval methods are commonly used, including keyword matching, text similarity, and simple graph-based heuristics such as Breadth-First Search. Graph Neural Networks (GNNs) offer a promising alternative due to their ability to model complex, repository-wide dependencies; however, their application has been hindered by the lack of a dedicated benchmark. To address this gap, we introduce GREPO, the first GNN benchmark for repository-scale bug localization tasks. GREPO comprises 86 Python repositories and 47294 bug-fixing tasks, providing graph-based data structures ready for direct GNN processing. Our evaluatio

arXiv:2602.13921v1 Announce Type: new Abstract: Repository-level bug localization-the task of identifying where code must be modified to fix a bug-is a critical software engineering challenge. Standard Large Language Modles (LLMs) are often unsuitable for this task due to context window limitations that prevent them from processing entire code repositories. As a result, various retrieval methods are commonly used, including keyword matching, text similarity, and simple graph-based heuristics such as Breadth-First Search. Graph Neural Networks (GNNs) offer a promising alternative due to their ability to model complex, repository-wide dependencies; however, their application has been hindered by the lack of a dedicated benchmark. To address this gap, we introduce GREPO, the first GNN benchmark for repository-scale bug localization tasks. GREPO comprises 86 Python repositories and 47294 bug-fixing tasks, providing graph-based data structures ready for direct GNN processing. Our evaluation of various GNN architectures shows outstanding performance compared to established information retrieval baselines. This work highlights the potential of GNNs for bug localization and established GREPO as a foundation resource for future research, The code is available at https://github.com/qingpingmo/GREPO.

Executive Summary

The article introduces GREPO, a novel benchmark for evaluating Graph Neural Networks (GNNs) in repository-level bug localization. This task is crucial for software engineering as it involves identifying the precise location in a code repository where modifications are needed to fix a bug. Traditional Large Language Models (LLMs) are often ineffective due to their context window limitations, which prevent them from processing entire repositories. GREPO addresses this gap by providing a comprehensive dataset of 86 Python repositories and 47,294 bug-fixing tasks, structured as graph-based data ready for GNN processing. The evaluation of various GNN architectures on GREPO demonstrates superior performance compared to conventional information retrieval methods, highlighting the potential of GNNs in bug localization. The benchmark is expected to serve as a foundational resource for future research in this area.

Key Points

  • Introduction of GREPO benchmark for GNN-based bug localization.
  • GREPO comprises 86 Python repositories and 47,294 bug-fixing tasks.
  • GNNs outperform traditional retrieval methods in bug localization tasks.
  • GREPO is designed to facilitate future research in graph-based bug localization.

Merits

Comprehensive Dataset

GREPO provides a large and diverse dataset specifically tailored for GNN-based bug localization, addressing a significant gap in the field.

Superior Performance

The evaluation shows that GNNs perform better than traditional retrieval methods, demonstrating their potential in complex bug localization tasks.

Foundation for Future Research

GREPO serves as a valuable resource for researchers, enabling further exploration and development of GNN applications in software engineering.

Demerits

Limited Scope

The benchmark is currently limited to Python repositories, which may not fully represent the diversity of programming languages used in software development.

Data Quality

The quality and representativeness of the bug-fixing tasks in GREPO could impact the generalizability of the findings.

Computational Resources

Training and evaluating GNNs on large-scale repositories require significant computational resources, which may limit accessibility for some researchers.

Expert Commentary

The introduction of GREPO marks a significant advancement in the field of bug localization, addressing a critical need for scalable and effective methods to handle repository-level tasks. The benchmark's comprehensive dataset and the demonstrated superiority of GNNs over traditional retrieval methods underscore the potential of graph-based approaches in software engineering. However, the current focus on Python repositories and the computational demands of GNNs present challenges that need to be addressed to ensure the broad applicability and accessibility of these methods. Future research should aim to expand the dataset to include a wider range of programming languages and optimize GNN models for efficiency. Additionally, the integration of GREPO into educational and industrial practices could foster a more robust understanding and application of GNNs in bug localization, ultimately enhancing the quality and reliability of software systems.

Recommendations

  • Expand GREPO to include repositories from multiple programming languages to enhance the benchmark's generalizability.
  • Develop more efficient GNN architectures to reduce computational requirements and improve accessibility for researchers and practitioners.

Sources