Discrete Flow Matching Policy Optimization
arXiv:2604.06491v1 Announce Type: new Abstract: We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design.
arXiv:2604.06491v1 Announce Type: new Abstract: We introduce Discrete flow Matching policy Optimization (DoMinO), a unified framework for Reinforcement Learning (RL) fine-tuning Discrete Flow Matching (DFM) models under a broad class of policy gradient methods. Our key idea is to view the DFM sampling procedure as a multi-step Markov Decision Process. This perspective provides a simple and transparent reformulation of fine-tuning reward maximization as a robust RL objective. Consequently, it not only preserves the original DFM samplers but also avoids biased auxiliary estimators and likelihood surrogates used by many prior RL fine-tuning methods. To prevent policy collapse, we also introduce new total-variation regularizers to keep the fine-tuned distribution close to the pretrained one. Theoretically, we establish an upper bound on the discretization error of DoMinO and tractable upper bounds for the regularizers. Experimentally, we evaluate DoMinO on regulatory DNA sequence design. DoMinO achieves stronger predicted enhancer activity and better sequence naturalness than the previous best reward-driven baselines. The regularization further improves alignment with the natural sequence distribution while preserving strong functional performance. These results establish DoMinO as an useful framework for controllable discrete sequence generation.
Executive Summary
The article introduces Discrete Flow Matching policy Optimization (DoMinO), a novel framework for fine-tuning Discrete Flow Matching (DFM) models within Reinforcement Learning (RL). By conceptualizing DFM sampling as a multi-step Markov Decision Process, DoMinO reformulates reward maximization as an RL objective, thereby avoiding common pitfalls like biased estimators and preserving DFM samplers. The framework incorporates total-variation regularizers to mitigate policy collapse and maintain distributional fidelity to pretrained models. Theoretical contributions include discretization error bounds and tractable bounds for regularizers. Empirical validation on regulatory DNA sequence design demonstrates DoMinO's superior performance in enhancer activity and sequence naturalness, with regularization further enhancing alignment to natural distributions while sustaining functional efficacy. This positions DoMinO as a promising approach for controllable discrete sequence generation.
Key Points
- ▸ DoMinO unifies DFM fine-tuning with RL by framing DFM sampling as a multi-step MDP.
- ▸ The approach avoids biased auxiliary estimators and likelihood surrogates, preserving original DFM samplers.
- ▸ Total-variation regularizers are introduced to prevent policy collapse and maintain fidelity to pretrained distributions.
- ▸ Theoretical analysis provides upper bounds for discretization error and regularizers.
- ▸ Experimental validation on regulatory DNA sequence design shows improved enhancer activity and sequence naturalness, with regularization balancing performance and naturalness.
Merits
Conceptual Elegance and Unification
Framing DFM sampling as a multi-step MDP is a highly insightful and elegant conceptual leap, providing a robust foundation for integrating DFM with policy gradient RL methods. This unification avoids ad-hoc solutions often seen in prior work.
Robustness and Bias Reduction
The explicit avoidance of biased auxiliary estimators and likelihood surrogates is a significant strength, contributing to more reliable and stable fine-tuning. This directly addresses a known vulnerability in many RL fine-tuning paradigms.
Distributional Fidelity and Stability
The introduction of total-variation regularizers is a pragmatic and theoretically sound mechanism to prevent policy collapse and ensure the fine-tuned distribution remains close to the pretrained one. This is crucial for maintaining the 'naturalness' or domain-specific characteristics of generated sequences.
Theoretical Rigor
The establishment of upper bounds on discretization error and tractable bounds for regularizers demonstrates a strong theoretical underpinning, providing confidence in the method's stability and predictability.
Empirical Efficacy and Practical Application
Successful application to regulatory DNA sequence design, achieving superior performance in both functional activity and naturalness, highlights the practical utility and real-world applicability of DoMinO in a complex biological domain.
Demerits
Computational Cost of TV Regularization
While theoretically sound, total-variation regularization can be computationally intensive, especially for high-dimensional discrete spaces. The 'tractable upper bounds' do not necessarily imply computational efficiency in practice, which could limit scalability for extremely long or complex sequences.
Generality of MDP Formulation
While the MDP formulation is elegant for DFM, its direct applicability to other discrete generative models (e.g., autoregressive models without a 'flow' interpretation) might not be as straightforward, potentially limiting the 'unified framework' claim to DFM variants.
Sensitivity to Hyperparameters
RL methods, including policy gradient, are often sensitive to hyperparameters (e.g., learning rates, regularization weights). The practical fine-tuning of DoMinO might require significant tuning effort, which is not extensively discussed.
Scope of Experimental Validation
While regulatory DNA sequence design is a compelling application, evaluation across a broader range of discrete sequence generation tasks (e.g., text, code, other molecular designs) would further solidify the claims of general utility and robustness.
Expert Commentary
DoMinO represents a significant advancement in the challenging landscape of controllable discrete sequence generation, particularly by bridging the often-disparate fields of flow matching and reinforcement learning. The core insight of re-framing DFM sampling as a multi-step MDP is not merely an incremental improvement but a foundational conceptual shift that elegantly unifies these paradigms. This robust formulation inherently tackles issues of bias and instability that plague many reward-driven fine-tuning methods. The theoretical grounding, particularly the careful consideration of discretization error and regularization, elevates DoMinO beyond a heuristic approach. The experimental results in regulatory DNA design are compelling, showcasing a rare combination of enhanced functional performance and preservation of distributional naturalness, which is often a difficult trade-off. From a legal and ethical perspective, such precise control over the generation of functional biological sequences intensifies the urgency for robust regulatory oversight and IP frameworks. While computational scalability of TV regularization might be a practical consideration for extremely large sequence spaces, the theoretical elegance and empirical performance make DoMinO a pivotal contribution, setting a new benchmark for fine-tuning discrete generative models.
Recommendations
- ✓ Conduct further research into computationally efficient approximations or variants of total-variation regularization to enhance scalability for very high-dimensional discrete spaces.
- ✓ Explore the applicability of DoMinO to other discrete sequence generation domains beyond biological design, such as text generation (e.g., legal document synthesis), code generation, or material design, to validate its generality.
- ✓ Investigate the interpretability aspects of DoMinO's generated sequences, particularly in scientific contexts, to move beyond 'what works' to 'why it works', potentially through feature attribution methods or sensitivity analysis.
- ✓ Develop robust open-source implementations and benchmarks to facilitate broader adoption, comparative studies, and further innovation within the research community.
Sources
Original: arXiv - cs.LG