Academic

EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

Sushant Mehta, Logan Ritchie, Suhaas Garre, Nick Heiner, Edwin Chen · February 23, 2026 · 1 min read · 5 views

#cs.AI #cs.LG

arXiv:2602.16179v1 Announce Type: new Abstract: We show that training AI agents on high-fidelity reinforcement learning environments produces capabilities that generalize beyond the training distribution. We introduce \corecraft{}, the first environment in \textsc{EnterpriseGym}, Surge AI's suite of agentic RL environments. \corecraft{} is a fully operational enterprise simulation of a customer support organization, comprising over 2,500 entities across 14 entity types with 23 unique tools, designed to measure whether AI agents can perform the multi-step, domain-specific work that real jobs demand. Frontier models such as GPT-5.2 and Claude Opus 4.6 solve fewer than 30\% of tasks when all expert-authored rubric criteria must be satisfied. Using this environment, we train GLM~4.6 with Group Relative Policy Optimization (GRPO) and adaptive clipping. After a single epoch of training, the model improves from 25.37\% to 36.76\% task pass rate on held-out evaluation tasks. More importantly, these gains transfer to out-of-distribution benchmarks: +4.5\% on BFCL Parallel, +7.4\% on $\tau^2$-Bench Retail, and +6.8\% on Toolathlon (Pass@1). We believe three environment properties are consistent with the observed transfer: task-centric world building that optimizes for diverse, challenging tasks; expert-authored rubrics enabling reliable reward computation; and enterprise workflows that reflect realistic professional patterns. Our results suggest that environment quality, diversity, and realism are key factors enabling generalizable agent capabilities.

Executive Summary

This article presents EnterpriseGym Corecraft, a high-fidelity reinforcement learning environment designed to train AI agents for real-world tasks. The authors demonstrate that models trained on this environment exhibit generalizable capabilities, improving performance on out-of-distribution benchmarks. The study highlights the importance of environment quality, diversity, and realism in achieving generalizable agent capabilities. The findings suggest that training AI agents on high-fidelity environments can lead to significant improvements in task completion rates and domain adaptation. The authors' approach and results contribute to the ongoing discussion on the role of environment design in AI development.

Key Points

▸ The introduction of EnterpriseGym Corecraft, a high-fidelity reinforcement learning environment for training AI agents
▸ Demonstration of generalizable capabilities in models trained on this environment
▸ Importance of environment quality, diversity, and realism in achieving generalizable agent capabilities

Merits

Strength in Environment Design

The authors have created a sophisticated and realistic environment that effectively simulates a customer support organization, allowing for robust testing and evaluation of AI models.

Contribution to Generalizability Research

This study provides valuable insights into the factors that contribute to generalizable agent capabilities, highlighting the significance of environment quality and realism in AI development.

Demerits

Limited Model Selection

The article primarily focuses on the performance of a single model architecture (GLM 4.6) and a single optimization technique (Group Relative Policy Optimization), which may limit the generalizability of the results.

Lack of Comparative Analysis

The study does not provide a comprehensive comparison with other reinforcement learning environments or techniques, which might make it challenging to evaluate the relative effectiveness of EnterpriseGym Corecraft.

Expert Commentary

This study presents a significant contribution to the field of AI development, highlighting the importance of environment quality, diversity, and realism in achieving generalizable agent capabilities. While the article has some limitations, such as the limited model selection and lack of comparative analysis, the findings are compelling and suggest that high-fidelity environments like EnterpriseGym Corecraft can lead to significant improvements in task completion rates and domain adaptation. The implications of this study are far-reaching, with potential applications in areas such as customer support, healthcare, and finance. As the field of AI continues to evolve, it is essential to prioritize investments in environment design and development, ensuring that AI systems are designed to operate effectively in real-world scenarios.

Recommendations

✓ Future research should focus on developing a diverse range of high-fidelity environments that can simulate various real-world scenarios, allowing for more comprehensive evaluation and comparison of AI models.
✓ The development of more advanced optimization techniques and model architectures should be prioritized, particularly those that can effectively leverage the strengths of high-fidelity environments like EnterpriseGym Corecraft.

Sources

arXiv - cs.AI

Something extraordinary is coming.

EnterpriseGym Corecraft: Training Generalizable Agents on High-Fidelity RL Environments

AI Commentary

Executive Summary

Key Points

Merits

Strength in Environment Design

Contribution to Generalizability Research

Demerits

Limited Model Selection

Lack of Comparative Analysis

Expert Commentary

Recommendations

Sources

Related Articles

Humans and LLMs Diverge on Probabilistic Inferences

France or Spain or Germany or France: A Neural Account …

Multi-Agent Causal Reasoning for Suicide Ideation Detection Through Online Conversations

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of …

JCG, PC

HSOLLC Co., Ltd.