Skip to main content
Academic

TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models

arXiv:2602.15449v1 Announce Type: new Abstract: Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression fro

arXiv:2602.15449v1 Announce Type: new Abstract: Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.

Executive Summary

The article proposes a novel approach, TAROT, to address the challenges in synthesizing algorithmically sophisticated and robust code with Large Language Models (LLMs). TAROT constructs a four-tier test suite to provide a controlled difficulty landscape for curriculum design and evaluation. The approach decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation. Experimental results demonstrate that the optimal curriculum for Reinforcement Fine-Tuning (RFT) in code generation is closely tied to a model's inherent capability. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, improving the functional correctness and robustness of generated code. The proposed method has the potential to advance the field of code generation with LLMs.

Key Points

  • TAROT constructs a four-tier test suite for curriculum design and evaluation
  • The approach decouples curriculum progression from raw reward scores
  • Experimental results demonstrate the importance of capability-conditioned evaluation

Merits

Novel Approach to Curriculum Design

TAROT's four-tier test suite and capability-conditioned evaluation provide a novel approach to curriculum design, addressing the challenges in synthesizing algorithmically sophisticated and robust code with LLMs.

Improved Code Generation

The proposed method has the potential to improve the functional correctness and robustness of generated code, making it more reliable and efficient.

Demerits

Limited Experimental Scope

The article's experimental results are limited to a specific set of models and tasks, which may not be representative of the broader range of LLMs and code generation tasks.

Dependence on Model Capability

TAROT's performance may be heavily dependent on the model's inherent capability, which may limit its applicability to less capable models.

Expert Commentary

The article presents a novel approach to curriculum design and evaluation for code generation with LLMs. While the proposed method shows promising results, further experimentation and evaluation are necessary to fully understand its potential and limitations. The article's findings have implications for the development and application of LLMs in various fields, and the proposed method has the potential to improve the efficiency and reliability of code generation. However, the dependence on model capability and limited experimental scope are notable limitations that need to be addressed in future work.

Recommendations

  • Further experimentation and evaluation are necessary to fully understand the potential and limitations of TAROT
  • The proposed method should be applied to a broader range of LLMs and code generation tasks to validate its generalizability

Sources