Academic

SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale

arXiv:2602.23866v1 Announce Type: cross Abstract: Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,

arXiv:2602.23866v1 Announce Type: cross Abstract: Software engineering agents (SWE) are improving rapidly, with recent gains largely driven by reinforcement learning (RL). However, RL training is constrained by the scarcity of large-scale task collections with reproducible execution environments and reliable test suites. Although a growing number of benchmarks have emerged, datasets suitable for training remain limited in scale and diversity or often target a limited set of high-resource language ecosystems. We introduce SWE-rebench V2, a language-agnostic automated pipeline for harvesting executable real-world SWE tasks and constructing RL training environments at scale. The pipeline synthesizes repository-specific installation and test procedures via an interactive setup agent, and filters unsound instances using an ensemble of LLM judges, validated against human-verified SWE-bench annotations. Using this pipeline, we construct a dataset of 32,000+ tasks spanning 20 languages and 3,600+ repositories, with pre-built images for reproducible execution. To further scale training data, we additionally release 120,000+ tasks with installation instructions, fail-to-pass tests and rich metadata, where the problem statement is generated based on the original pull request description. We validate the collected instances through a diagnostic study that covers a subset of tasks in five programming languages across seven popular models, and provide instance-level metadata that flags common confounders such as overly restrictive tests and underspecified descriptions. We release the datasets, the collection and execution code, and associated artifacts to enable large-scale training of SWE agents across diverse languages and repositories.

Executive Summary

The article introduces SWE-rebench V2, a language-agnostic automated pipeline for collecting executable real-world software engineering tasks and constructing reinforcement learning training environments at scale. The pipeline synthesizes repository-specific installation and test procedures, filters unsound instances using language models, and releases datasets with pre-built images for reproducible execution. The study collects over 150,000 tasks spanning 20 languages and 3,600+ repositories, with instance-level metadata flagging common confounders. The datasets and collection code are released to enable large-scale training of software engineering agents across diverse languages and repositories.

Key Points

  • SWE-rebench V2 is a language-agnostic automated pipeline for collecting software engineering tasks.
  • The pipeline synthesizes repository-specific installation and test procedures.
  • The study collects over 150,000 tasks spanning 20 languages and 3,600+ repositories.

Merits

Strength in Scalability

SWE-rebench V2 achieves significant scalability in collecting large-scale software engineering tasks, making it a valuable resource for training reinforcement learning models.

Language-Agnostic Approach

The pipeline's language-agnostic approach enables the collection of tasks across diverse languages, making it a more comprehensive resource for software engineering research.

Demerits

Limited Generalizability

The study's reliance on existing repositories and pull requests may limit the generalizability of the collected tasks to real-world software engineering scenarios.

Potential for Biased Task Selection

The use of language models to filter unsound instances may introduce bias in the task selection process, potentially excluding certain types of tasks or scenarios.

Expert Commentary

The introduction of SWE-rebench V2 represents a significant advancement in the field of software engineering research, particularly in the area of reinforcement learning. The pipeline's ability to collect large-scale, diverse software engineering tasks at scale is a major strength, and its language-agnostic approach is a significant improvement over existing benchmarks. However, the study's reliance on existing repositories and pull requests may limit the generalizability of the collected tasks, and the use of language models to filter unsound instances may introduce bias in the task selection process. Despite these limitations, the study's findings have significant implications for the development of more robust and effective reinforcement learning models for software engineering, and its release of datasets and collection code will enable further research and innovation in this area.

Recommendations

  • Future studies should focus on further evaluating the generalizability of the collected tasks and exploring alternative methods for task filtering and selection.
  • The development of SWE-rebench V2 should be followed by a broader effort to establish comprehensive benchmarks for software engineering research and practice, incorporating a diverse range of tasks and scenarios.

Sources