Academic

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu · March 7, 2026 · 1 min read · 20 views

#cs.SD #cs.AI #cs.MM #eess.AS

arXiv:2602.12304v1 Announce Type: cross Abstract: Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity.

Executive Summary

The article 'OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model' introduces a novel approach to video customization that synchronizes both video identity and audio timbre. The proposed framework, OmniCustom, leverages a DiT-based model to generate videos that maintain the identity of a reference image while imitating the timbre of a reference audio, with spoken content specified through textual prompts. The model employs separate reference identity and audio LoRA modules and utilizes a contrastive learning objective to enhance identity and timbre preservation. Trained on a large-scale, high-quality audio-visual human dataset, OmniCustom demonstrates superior performance in generating consistent audio-video content.

Key Points

▸ Introduction of a novel task: sync audio-video customization.
▸ Proposal of OmniCustom, a DiT-based framework for joint audio-video generation.
▸ Use of separate reference identity and audio LoRA modules for control.
▸ Implementation of a contrastive learning objective to enhance identity and timbre preservation.
▸ Training on a large-scale, high-quality audio-visual human dataset.

Merits

Innovative Approach

The article introduces a novel task of synchronizing audio and video customization, addressing a gap in current methods that focus solely on video identity.

Advanced Framework

OmniCustom's use of a DiT-based model and separate LoRA modules for identity and timbre control represents a significant advancement in the field.

Effective Training

The training on a large-scale, high-quality dataset ensures robust performance and generalization of the model.

Demerits

Complexity

The complexity of the model and the need for extensive training data may limit its accessibility and practical application.

Zero-Shot Limitation

While the model claims zero-shot capability, real-world performance may vary, and additional fine-tuning might be required for specific use cases.

Expert Commentary

The article presents a significant advancement in the field of audio-video customization, addressing a critical gap in current methodologies. The introduction of OmniCustom, a framework that synchronizes video identity and audio timbre, is a notable contribution. The use of separate LoRA modules and contrastive learning objectives enhances the model's ability to preserve identity and timbre, making it a robust tool for various applications. However, the complexity of the model and the ethical implications of deepfake technology cannot be overlooked. The practical applications of OmniCustom are vast, ranging from entertainment to virtual reality, but they come with the responsibility of ensuring ethical use and data privacy. The article's findings underscore the need for ongoing research and policy development to harness the benefits of such technologies while mitigating potential risks.

Recommendations

✓ Further research to simplify the model and make it more accessible for practical applications.
✓ Development of robust detection and mitigation strategies to address the potential misuse of deepfake technology.
✓ Establishment of regulatory frameworks and ethical guidelines to ensure the responsible use of audio-video customization technologies.

Sources

arXiv - cs.AI

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

AI Commentary

Executive Summary

Key Points

Merits

Innovative Approach

Advanced Framework

Effective Training

Demerits

Complexity

Zero-Shot Limitation

Expert Commentary

Recommendations

Sources

Related Articles

AI-Driven Approaches to Enhancing Fairness and Identifying Algorithmic Bias in …

High resolution schemes for hyperbolic conservation laws

Robust Graph Representation Learning via Adaptive Spectral Contrast

Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via …

JCG, PC

HSOLLC Co., Ltd.