Academic

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

arXiv:2602.17684v1 Announce Type: cross Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effect

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo · March 7, 2026 · 1 min read · 18 views

#cs.LG #cs.AI

Executive Summary

The proposed CodeScaler model aims to improve the scalability of code large language models by introducing an execution-free reward model. This approach enables reinforcement learning training and test-time inference without relying on high-quality test cases, thereby addressing a significant limitation of existing methods. CodeScaler demonstrates impressive performance across various coding benchmarks, outperforming binary execution-based RL and achieving comparable results to unit test approaches while reducing latency. Its effectiveness is also evident in general and reasoning domains, surpassing existing reward models.

Key Points

▸ CodeScaler introduces an execution-free reward model for scaling code large language models
▸ The model is trained on curated preference data derived from verified code problems
▸ CodeScaler achieves state-of-the-art performance across five coding benchmarks and general domains

Merits

Improved Scalability

CodeScaler enables scalable reinforcement learning on synthetic datasets without requiring high-quality test cases

Enhanced Performance

CodeScaler outperforms binary execution-based RL and achieves comparable results to unit test approaches

Reduced Latency

CodeScaler provides a 10-fold reduction in latency compared to unit test approaches

Demerits

Limited Generalizability

The effectiveness of CodeScaler may be limited to specific coding benchmarks and domains

Dependence on Curated Data

CodeScaler relies on carefully curated preference data, which may be time-consuming and expensive to obtain

Expert Commentary

CodeScaler represents a significant advancement in the field of code large language models, addressing a critical limitation of existing methods. The model's execution-free reward mechanism enables scalable reinforcement learning and test-time inference, making it an attractive solution for various industries. However, its effectiveness and generalizability must be carefully evaluated, and potential limitations and biases must be addressed. As AI-powered coding tools become increasingly prevalent, it is essential to consider the broader implications of CodeScaler and similar models on the workforce, ethics, and fairness.

Recommendations

✓ Further research is needed to evaluate the generalizability and limitations of CodeScaler
✓ Developers and regulatory bodies should consider the potential implications of CodeScaler on AI ethics, fairness, and transparency

Sources

arXiv - cs.AI

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

AI Commentary

Executive Summary

Key Points

Merits

Improved Scalability

Enhanced Performance

Reduced Latency

Demerits

Limited Generalizability

Dependence on Curated Data

Expert Commentary

Recommendations

Sources

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs

JCG, PC

HSOLLC Co., Ltd.

Related Articles

ConstitutionGPT: An AI-Powered Multilingual Legal Assistance System for Indian Citizens

AI Copyright Infringement: Navigating the Legal Risks of AI-Generated Content

The Rhetoric of Machine Learning

Busemann energy-based attention for emotion analysis in Poincar\'e discs