CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models
arXiv:2602.17684v1 Announce Type: cross Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effect
arXiv:2602.17684v1 Announce Type: cross Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).
Executive Summary
The proposed CodeScaler model aims to improve the scalability of code large language models by introducing an execution-free reward model. This approach enables reinforcement learning training and test-time inference without relying on high-quality test cases, thereby addressing a significant limitation of existing methods. CodeScaler demonstrates impressive performance across various coding benchmarks, outperforming binary execution-based RL and achieving comparable results to unit test approaches while reducing latency. Its effectiveness is also evident in general and reasoning domains, surpassing existing reward models.
Key Points
- ▸ CodeScaler introduces an execution-free reward model for scaling code large language models
- ▸ The model is trained on curated preference data derived from verified code problems
- ▸ CodeScaler achieves state-of-the-art performance across five coding benchmarks and general domains
Merits
Improved Scalability
CodeScaler enables scalable reinforcement learning on synthetic datasets without requiring high-quality test cases
Enhanced Performance
CodeScaler outperforms binary execution-based RL and achieves comparable results to unit test approaches
Reduced Latency
CodeScaler provides a 10-fold reduction in latency compared to unit test approaches
Demerits
Limited Generalizability
The effectiveness of CodeScaler may be limited to specific coding benchmarks and domains
Dependence on Curated Data
CodeScaler relies on carefully curated preference data, which may be time-consuming and expensive to obtain
Expert Commentary
CodeScaler represents a significant advancement in the field of code large language models, addressing a critical limitation of existing methods. The model's execution-free reward mechanism enables scalable reinforcement learning and test-time inference, making it an attractive solution for various industries. However, its effectiveness and generalizability must be carefully evaluated, and potential limitations and biases must be addressed. As AI-powered coding tools become increasingly prevalent, it is essential to consider the broader implications of CodeScaler and similar models on the workforce, ethics, and fairness.
Recommendations
- ✓ Further research is needed to evaluate the generalizability and limitations of CodeScaler
- ✓ Developers and regulatory bodies should consider the potential implications of CodeScaler on AI ethics, fairness, and transparency