OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model
arXiv:2602.12304v1 Announce Type: cross Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. F
arXiv:2602.12304v1 Announce Type: cross Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.
Executive Summary
The article 'OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model' introduces a novel approach to video customization that synchronizes both video identity and audio timbre. The proposed framework, OmniCustom, leverages a DiT-based model to generate videos that maintain the identity of a reference image while imitating the timbre of a reference audio, with spoken content specified through textual prompts. The model employs separate reference identity and audio LoRA modules and utilizes a contrastive learning objective to enhance identity and timbre preservation. Trained on a large-scale, high-quality audio-visual human dataset, OmniCustom demonstrates superior performance in generating consistent audio-video content.
Key Points
- ▸ Introduction of a novel task: sync audio-video customization.
- ▸ Proposal of OmniCustom, a DiT-based framework for joint audio-video generation.
- ▸ Use of separate reference identity and audio LoRA modules for control.
- ▸ Implementation of a contrastive learning objective to enhance identity and timbre preservation.
- ▸ Training on a large-scale, high-quality audio-visual human dataset.
Merits
Innovative Approach
The article introduces a novel task of synchronizing audio and video customization, addressing a gap in current methods that focus solely on video identity.
Advanced Framework
OmniCustom's use of a DiT-based model and separate LoRA modules for identity and timbre control represents a significant advancement in the field.
Effective Training
The training on a large-scale, high-quality dataset ensures robust performance and generalization of the model.
Demerits
Complexity
The complexity of the model and the need for extensive training data may limit its accessibility and practical application.
Zero-Shot Limitation
While the model claims zero-shot capability, real-world performance may vary, and additional fine-tuning might be required for specific use cases.
Expert Commentary
The article presents a significant advancement in the field of audio-video customization, addressing a critical gap in current methodologies. The introduction of OmniCustom, a framework that synchronizes video identity and audio timbre, is a notable contribution. The use of separate LoRA modules and contrastive learning objectives enhances the model's ability to preserve identity and timbre, making it a robust tool for various applications. However, the complexity of the model and the ethical implications of deepfake technology cannot be overlooked. The practical applications of OmniCustom are vast, ranging from entertainment to virtual reality, but they come with the responsibility of ensuring ethical use and data privacy. The article's findings underscore the need for ongoing research and policy development to harness the benefits of such technologies while mitigating potential risks.
Recommendations
- ✓ Further research to simplify the model and make it more accessible for practical applications.
- ✓ Development of robust detection and mitigation strategies to address the potential misuse of deepfake technology.
- ✓ Establishment of regulatory frameworks and ethical guidelines to ensure the responsible use of audio-video customization technologies.