Sounding Video Generator: A Unified Framework for Text-guided Sounding Video Generation

by   Jiawei Liu, et al.

As a combination of visual and audio signals, video is inherently multi-modal. However, existing video generation methods are primarily intended for the synthesis of visual frames, whereas audio signals in realistic videos are disregarded. In this work, we concentrate on a rarely investigated problem of text guided sounding video generation and propose the Sounding Video Generator (SVG), a unified framework for generating realistic videos along with audio signals. Specifically, we present the SVG-VQGAN to transform visual frames and audio melspectrograms into discrete tokens. SVG-VQGAN applies a novel hybrid contrastive learning method to model inter-modal and intra-modal consistency and improve the quantized representations. A cross-modal attention module is employed to extract associated features of visual frames and audio signals for contrastive learning. Then, a Transformer-based decoder is used to model associations between texts, visual frames, and audio signals at token level for auto-regressive sounding video generation. AudioSetCap, a human annotated text-video-audio paired dataset, is produced for training SVG. Experimental results demonstrate the superiority of our method when compared with existing textto-video generation methods as well as audio generation methods on Kinetics and VAS datasets.


page 1

page 2

page 4

page 8

page 9

page 10

page 11


Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Recently, video object segmentation (VOS) referred by multi-modal signal...

FoleyGen: Visually-Guided Audio Generation

Recent advancements in audio generation have been spurred by the evoluti...

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Generating a video given the first several static frames is challenging ...

An Initial Exploration: Learning to Generate Realistic Audio for Silent Video

Generating realistic audio effects for movies and other media is a chall...

Synchronizing Audio-Visual Film Stimuli in Unity (version 5.5.1f1): Game Engines as a Tool for Research

Unity is a software specifically designed for the development of video g...

Collaborative Learning to Generate Audio-Video Jointly

There have been a number of techniques that have demonstrated the genera...

A Multi-modal Deep Learning Model for Video Thumbnail Selection

Thumbnail is the face of online videos. The explosive growth of videos b...

Please sign up or login with your details

Forgot password? Click here to reset