EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments
arXiv:2602.16179v1 Announce Type: new Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI's suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30\% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks. More importantly,
arXiv:2602.16179v1 Announce Type: new Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI's suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30\% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5\% on BFCL Parallel, +7.4\% on $\tau^2$-Bench Retail, and +6.8\% on Toolathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.
Executive Summary
This article presents EnterpriseGym Corecraft, a high-fidelity reinforcement learning environment designed to train AI agents for real-world tasks. The authors demonstrate that models trained on this environment exhibit generalizable capabilities, improving performance on out-of-distribution benchmarks. The study highlights the importance of environment quality, diversity, and realism in achieving generalizable agent capabilities. The findings suggest that training AI agents on high-fidelity environments can lead to significant improvements in task completion rates and domain adaptation. The authors' approach and results contribute to the ongoing discussion on the role of environment design in AI development.
Key Points
- ▸ The introduction of EnterpriseGym Corecraft, a high-fidelity reinforcement learning environment for training AI agents
- ▸ Demonstration of generalizable capabilities in models trained on this environment
- ▸ Importance of environment quality, diversity, and realism in achieving generalizable agent capabilities
Merits
Strength in Environment Design
The authors have created a sophisticated and realistic environment that effectively simulates a customer support organization, allowing for robust testing and evaluation of AI models.
Contribution to Generalizability Research
This study provides valuable insights into the factors that contribute to generalizable agent capabilities, highlighting the significance of environment quality and realism in AI development.
Demerits
Limited Model Selection
The article primarily focuses on the performance of a single model architecture (GLM 4.6) and a single optimization technique (Group Relative Policy Optimization), which may limit the generalizability of the results.
Lack of Comparative Analysis
The study does not provide a comprehensive comparison with other reinforcement learning environments or techniques, which might make it challenging to evaluate the relative effectiveness of EnterpriseGym Corecraft.
Expert Commentary
This study presents a significant contribution to the field of AI development, highlighting the importance of environment quality, diversity, and realism in achieving generalizable agent capabilities. While the article has some limitations, such as the limited model selection and lack of comparative analysis, the findings are compelling and suggest that high-fidelity environments like EnterpriseGym Corecraft can lead to significant improvements in task completion rates and domain adaptation. The implications of this study are far-reaching, with potential applications in areas such as customer support, healthcare, and finance. As the field of AI continues to evolve, it is essential to prioritize investments in environment design and development, ensuring that AI systems are designed to operate effectively in real-world scenarios.
Recommendations
- ✓ Future research should focus on developing a diverse range of high-fidelity environments that can simulate various real-world scenarios, allowing for more comprehensive evaluation and comparison of AI models.
- ✓ The development of more advanced optimization techniques and model architectures should be prioritized, particularly those that can effectively leverage the strengths of high-fidelity environments like EnterpriseGym Corecraft.