Controllable Image Generation with Composed Parallel Token Prediction
arXiv:2604.05730v1 Announce Type: new Abstract: Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a $63.4\%$ relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of $-9.58$. Meanwhile, our method offers a $2.3\times$ to $12\times$ real-time speed-up over comparable methods, and is readily a
arXiv:2604.05730v1 Announce Type: new Abstract: Conditional discrete generative models struggle to faithfully compose multiple input conditions. To address this, we derive a theoretically-grounded formulation for composing discrete probabilistic generative processes, with masked generation (absorbing diffusion) as a special case. Our formulation enables precise specification of novel combinations and numbers of input conditions that lie outside the training data, with concept weighting enabling emphasis or negation of individual conditions. In synergy with the richly compositional learned vocabulary of VQ-VAE and VQ-GAN, our method attains a $63.4\%$ relative reduction in error rate compared to the previous state-of-the-art, averaged across 3 datasets (positional CLEVR, relational CLEVR and FFHQ), simultaneously obtaining an average absolute FID improvement of $-9.58$. Meanwhile, our method offers a $2.3\times$ to $12\times$ real-time speed-up over comparable methods, and is readily applied to an open pre-trained discrete text-to-image model for fine-grained control of text-to-image generation.
Executive Summary
The paper titled 'Controllable Image Generation with Composed Parallel Token Prediction' introduces a novel method for enhancing the compositional capabilities of conditional discrete generative models, particularly in image generation tasks. By formulating a theoretically grounded approach to compose multiple input conditions, the authors address a critical limitation in existing models, enabling precise specification of novel condition combinations and concept weighting for emphasis or negation. Leveraging the VQ-VAE and VQ-GAN framework, the proposed method achieves a 63.4% relative reduction in error rate across three datasets (positional CLEVR, relational CLEVR, and FFHQ) and an average FID improvement of -9.58, while also offering significant speed advantages (2.3x to 12x faster) and compatibility with pre-trained discrete text-to-image models for fine-grained control. This work bridges a gap in conditional generative modeling, advancing both theoretical and practical aspects of controllable image generation.
Key Points
- ▸ Introduces a theoretically grounded formulation for composing discrete probabilistic generative processes, extending masked generation (absorbing diffusion) as a special case.
- ▸ Enables precise specification of novel combinations and numbers of input conditions, including concept weighting for emphasis or negation of individual conditions.
- ▸ Achieves state-of-the-art performance with a 63.4% relative reduction in error rate and an average FID improvement of -9.58 across three datasets, while offering significant computational efficiency (2.3x to 12x speed-up).
- ▸ Demonstrates practical applicability by integrating with pre-trained discrete text-to-image models for fine-grained control of text-to-image generation.
Merits
Theoretical Rigor
The paper derives a mathematically sound formulation for composing discrete generative processes, extending existing frameworks like masked generation (absorbing diffusion) and providing a robust foundation for further research.
Performance Improvements
The method achieves substantial reductions in error rates (63.4%) and improvements in FID (-9.58) across multiple datasets, demonstrating its effectiveness in enhancing image generation quality and controllability.
Computational Efficiency
The approach offers significant speed-ups (2.3x to 12x) over comparable methods, making it more accessible and scalable for real-world applications without compromising performance.
Practical Applicability
The method is readily applicable to pre-trained discrete text-to-image models, enabling fine-grained control and broadening its utility in industrial and academic settings.
Demerits
Limited Generalization to Non-Discrete Models
The formulation is tailored for discrete generative models (e.g., VQ-VAE, VQ-GAN), which may limit its applicability to continuous or hybrid generative models like diffusion models or GANs.
Dependency on Rich Vocabulary
The method's performance heavily relies on the richness and compositionality of the learned vocabulary of VQ-VAE/VQ-GAN, which may not be universally available across all models or datasets.
Complexity of Concept Weighting
While concept weighting enables fine-grained control, the complexity of specifying and optimizing these weights may pose challenges for non-expert users or in scenarios requiring automated control.
Expert Commentary
This paper represents a significant advancement in the field of conditional generative models, particularly in addressing the longstanding challenge of composing multiple input conditions. The theoretical formulation is both elegant and rigorous, extending the foundational work on masked generation and discrete probabilistic processes. The empirical results are compelling, with substantial improvements in error rates and FID, demonstrating the method's superiority over state-of-the-art approaches. The computational efficiency gains are particularly noteworthy, as they address a critical bottleneck in deploying such models in real-world applications. However, the reliance on the compositional richness of VQ-VAE/VQ-GAN vocabularies may limit its applicability in some contexts, and the complexity of concept weighting could hinder widespread adoption among non-expert users. Additionally, the ethical implications of fine-grained control—such as the potential for misuse in generating deceptive or biased content—warrant careful consideration. Overall, this work sets a new benchmark for controllable image generation and opens avenues for further research in discrete generative modeling and its intersections with continuous processes.
Recommendations
- ✓ For researchers, further exploration of the theoretical formulation's applicability to continuous or hybrid generative models (e.g., diffusion models) could broaden its impact and address current limitations in generalization.
- ✓ For practitioners, developing user-friendly interfaces or automated tools for concept weighting could democratize access to the method's fine-grained control capabilities, making it more accessible to non-expert users in design, marketing, and other creative industries.
- ✓ For policymakers, proactive engagement with stakeholders to establish ethical guidelines and safeguards for controllable image generation is essential to mitigate risks such as deepfake proliferation, misinformation, and bias amplification while fostering innovation in this transformative field.
- ✓ For developers of pre-trained models, integrating the proposed method into existing pipelines (e.g., Stable Diffusion, DALL-E) could unlock new capabilities for users, enhancing the practical utility and adoption of controllable image generation tools.
Sources
Original: arXiv - cs.LG