Academic

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

arXiv:2603.03194v1 Announce Type: new Abstract: Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emul

arXiv:2603.03194v1 Announce Type: new Abstract: Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Executive Summary

This article introduces BeyondSWE, a comprehensive benchmark for code agents that assesses their capabilities in addressing real-world challenges such as cross-repository reasoning, domain-specialized problem solving, and full-repository generation. The authors find a significant capability gap in existing models, with none performing consistently across task types. They also develop SearchSWE, a framework that integrates deep search with coding abilities, but find that search augmentation yields inconsistent gains and can sometimes degrade performance. The study offers a realistic and challenging evaluation benchmark, as well as a flexible framework to advance research on more capable code agents. The findings highlight the need for code agents to interleave search and reasoning during coding tasks, and suggest that current models may not be capable of emulating developer-like workflows.

Key Points

  • BeyondSWE is a comprehensive benchmark for code agents that addresses real-world challenges
  • Existing models have a significant capability gap in addressing real-world challenges
  • SearchSWE, a framework that integrates deep search with coding abilities, yields inconsistent gains and can degrade performance

Merits

Strengths of the Study

The study introduces a comprehensive benchmark that assesses code agents' capabilities in addressing real-world challenges, providing a realistic and challenging evaluation framework for advancing research on more capable code agents.

Methodological Rigor

The study uses a well-designed experimental approach, with 500 real-world instances across four distinct settings, to investigate the role of external knowledge and the limitations of existing models.

Theoretical Contributions

The study offers a new framework, SearchSWE, that integrates deep search with coding abilities, providing a flexible approach for advancing research on code agents.

Demerits

Limitation of the Study

The study only evaluates existing models and does not propose new architectures or approaches that can address the limitations of current code agents.

Scalability of the Benchmark

The study uses a relatively small dataset of 500 real-world instances, which may not be sufficient to fully capture the complexity of real-world coding tasks.

Generalizability of the Results

The study only evaluates code agents on a specific set of tasks and settings, which may not be representative of the broader range of coding tasks and environments.

Expert Commentary

The study provides a valuable contribution to the field of code agents, highlighting the need for more comprehensive evaluations and the limitations of existing models. The introduction of BeyondSWE and SearchSWE frameworks offers a flexible approach for advancing research on code agents. However, the study's limitations, such as the small dataset and limited generalizability, need to be addressed in future research. Additionally, the study's findings have implications for software development and maintenance, and policy makers and researchers should prioritize the development of more capable code agents.

Recommendations

  • Future research should focus on developing new architectures and approaches that can address the limitations of current code agents
  • Investment in code agent research and development is necessary to advance the field and improve software development and maintenance

Sources