Academic

dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

arXiv:2603.18806v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the cor

arXiv:2603.18806v1 Announce Type: new Abstract: Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.

Executive Summary

The article proposes a novel policy optimization approach, Trajectory Reduction Policy Optimization (dTRPO), to improve the performance of Diffusion Large Language Models (dLLMs). dTRPO reduces the cost of trajectory probability calculation by leveraging two key strategies: (i) reference policy regularization and (ii) re-masked final state estimation. This enables scaled-up offline policy training and achieves substantial performance gains of up to 9.6% on STEM tasks, 4.3% on coding tasks, and 3.0% on instruction-following tasks. The approach also exhibits strong training efficiency and improved generation efficiency. The findings have significant implications for the development of more efficient and effective dLLMs, particularly in applications requiring high-quality outputs.

Key Points

  • dTRPO reduces the cost of trajectory probability calculation for dLLMs
  • Reference policy regularization and re-masked final state estimation are key strategies for dTRPO
  • dTRPO achieves substantial performance gains across various tasks and exhibits strong training efficiency

Merits

Strength

dTRPO's offline, single-forward nature enables significant training efficiency improvements

Strength

dTRPO's high-quality outputs result in improved generation efficiency

Strength

dTRPO achieves substantial performance gains across various tasks

Demerits

Limitation

The approach may not generalize well to other types of language models beyond dLLMs

Limitation

The computational cost of re-masked final state estimation may be high for large models

Limitation

The approach requires careful tuning of hyperparameters to achieve optimal performance

Expert Commentary

The article presents a well-researched and well-implemented approach to policy optimization for dLLMs. The authors' use of reference policy regularization and re-masked final state estimation is a key innovation that enables the efficient training of large language models. The substantial performance gains achieved by dTRPO across various tasks are impressive and demonstrate the potential of this approach. However, as with any research, there are limitations and areas for future work. The approach may not generalize well to other types of language models beyond dLLMs, and the computational cost of re-masked final state estimation may be high for large models. Nevertheless, the findings of this study are significant and have important implications for the development of more efficient and effective dLLMs.

Recommendations

  • Future research should explore the application of dTRPO to other types of language models
  • Careful tuning of hyperparameters is necessary to achieve optimal performance with dTRPO

Sources