Engaging video comments play an important role in video social media, as...
Training deep generative models usually requires a large amount of data....
The exploration of the latent space in StyleGANs and GAN inversion exemp...
Large Pre-trained Transformers exhibit an intriguing capacity for in-con...
Improving the generalization capabilities of general-purpose robotic age...
The exploration of the latent space in StyleGANs and GAN inversion exemp...
We propose a novel framework for learning high-level cognitive capabilit...
Due to the rapid development of computing hardware resources and the dra...
Different speaker recognition challenges have been held to assess the sp...
This report describes the SJTU-AISPEECH system for the Voxceleb Speaker
...
The pre-trained image-text models, like CLIP, have demonstrated the stro...
In this paper we provide the technique report of Ego4D natural language ...
This technical report describes the SJTU X-LANCE Lab system for the thre...
Vision Transformer has shown great visual representation power in substa...
We study joint video and language (VL) pre-training to enable cross-moda...
A creative image-and-text generative AI system mimics humans' extraordin...
We study the joint learning of image-to-text and text-to-image generatio...
In this paper we focus on landscape animation, which aims to generate
ti...
The defect detection task can be regarded as a realistic scenario of obj...
Vision-Language Pre-training (VLP) aims to learn multi-modal representat...
We study joint learning of Convolutional Neural Network (CNN) and Transf...
This paper presents a Multitask Multilingual Multimodal Pre-trained mode...
We propose Pixel-BERT to align image pixels with text by deep multi-moda...
A storyboard is a sequence of images to illustrate a story containing
mu...
We propose to boost VQA by leveraging more powerful feature extractors b...
Endoscopic diagnosis is an important means for gastric polyp detection. ...
We study on weakly-supervised object detection (WSOD) which plays a vita...
Contextual reasoning is essential to understand events in long untrimmed...
Automatic generation of natural language from images has attracted exten...