Masked Autoencoders (MAE) have been prevailing paradigms for large-scale...
The pre-trained image-text models, like CLIP, have demonstrated the stro...
We study joint video and language (VL) pre-training to enable cross-moda...
We study the joint learning of image-to-text and text-to-image generatio...
In this paper we focus on landscape animation, which aims to generate
ti...
Vision-Language Pre-training (VLP) aims to learn multi-modal representat...