Duality Models: An Embarrassingly Simple One-step Generation Paradigm
arXiv:2602.17682v1 Announce Type: new Abstract: Consistency-based generative models like Shortcut and MeanFlow achieve impressive results via a target-aware design for solving the Probability Flow ODE (PF-ODE). Typically, such methods introduce a target time $r$ alongside the current time $t$ to modulate outputs between a local multi-step derivative ($r = t$) and a global few-step integral ($r = 0$). However, the conventional "one input, one output" paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi-step objective for stability. This separation forces a trade-off: allocating sufficient samples to the multi-step objective leaves the few-step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a "one input, dual output" paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity $v_t$ and flow-map $u_t$
arXiv:2602.17682v1 Announce Type: new Abstract: Consistency-based generative models like Shortcut and MeanFlow achieve impressive results via a target-aware design for solving the Probability Flow ODE (PF-ODE). Typically, such methods introduce a target time $r$ alongside the current time $t$ to modulate outputs between a local multi-step derivative ($r = t$) and a global few-step integral ($r = 0$). However, the conventional "one input, one output" paradigm enforces a partition of the training budget, often allocating a significant portion (e.g., 75% in MeanFlow) solely to the multi-step objective for stability. This separation forces a trade-off: allocating sufficient samples to the multi-step objective leaves the few-step generation undertrained, which harms convergence and limits scalability. To this end, we propose Duality Models (DuMo) via a "one input, dual output" paradigm. Using a shared backbone with dual heads, DuMo simultaneously predicts velocity $v_t$ and flow-map $u_t$ from a single input $x_t$. This applies geometric constraints from the multi-step objective to every sample, bounding the few-step estimation without separating training objectives, thereby significantly improving stability and efficiency. On ImageNet 256 $\times$ 256, a 679M Diffusion Transformer with SD-VAE achieves a state-of-the-art (SOTA) FID of 1.79 in just 2 steps. Code is available at: https://github.com/LINs-lab/DuMo
Executive Summary
This article proposes a novel 'one input, dual output' paradigm, dubbed Duality Models (DuMo), for consistency-based generative models. DuMo leverages a shared backbone with dual heads to simultaneously predict velocity and flow-map from a single input, applying geometric constraints without separating training objectives. This approach significantly improves stability and efficiency, achieving state-of-the-art results on ImageNet 256x256 with a 679M Diffusion Transformer. DuMo's efficacy lies in its ability to balance the trade-off between multi-step and few-step objectives, enabling efficient convergence and scalability.
Key Points
- ▸ Duality Models (DuMo) employ a 'one input, dual output' paradigm for consistency-based generative models.
- ▸ DuMo uses a shared backbone with dual heads to predict velocity and flow-map from a single input.
- ▸ The approach significantly improves stability and efficiency by applying geometric constraints without separating training objectives.
Merits
Unified Objective
DuMo's 'one input, dual output' paradigm unifies the multi-step and few-step objectives, eliminating the need for separate training budgets and improving stability and efficiency.
Improved Scalability
By applying geometric constraints to every sample, DuMo enables efficient convergence and scalability, making it suitable for large-scale generative modeling tasks.
Demerits
Overreliance on Shared Backbone
DuMo's reliance on a shared backbone may limit its adaptability to different generative modeling tasks, requiring significant modifications to accommodate diverse objectives and constraints.
Computational Complexity
The dual-head architecture and geometric constraints may introduce additional computational complexity, potentially hindering practical applications with limited resources.
Expert Commentary
The article presents a well-structured and well-executed exploration of the Duality Models paradigm, leveraging a novel 'one input, dual output' approach to improve stability and efficiency in consistency-based generative models. While DuMo's reliance on a shared backbone and potential computational complexity are potential limitations, the article's findings are highly relevant to recent advancements in generative modeling for computer vision. The proposed approach may enable the development of more scalable and efficient generative models for real-world applications, with implications for AI research and development.
Recommendations
- ✓ Future research should investigate the adaptability of DuMo to different generative modeling tasks and objectives, as well as its potential applications in other domains, such as natural language processing or robotics.
- ✓ The development of more efficient and scalable generative models, such as DuMo, may require significant modifications to existing architectures and training procedures, highlighting the need for innovative and interdisciplinary approaches to AI research.