Self-supervised methods have shown remarkable progress in learning high-...
Transformers have become the primary backbone of the computer vision
com...
Animating virtual avatars to make co-speech gestures facilitates various...
In this paper, we consider the task of unsupervised object discovery in
...
In this paper, we propose a novel learning scheme for self-supervised vi...
We present a new benchmark dataset, Sapsucker Woods 60 (SSW60), for adva...
Contrastive learning has shown promising potential in self-supervised
sp...
This paper focuses on self-supervised video representation learning. Mos...
Generating speech-consistent body and gesture movements is a long-standi...
The task of audio-visual sound source localization has been well studied...
Audiovisual scenes are pervasive in our daily life. It is commonplace fo...
In light of the success of contrastive learning in the image domain, cur...
A recent work from Bello shows that training and scaling strategies may ...
The crux of self-supervised video representation learning is to build ge...
Few-shot action recognition aims to recognize novel action classes (quer...
We present a framework for learning multimodal representations from unla...
Building instance segmentation models that are data-efficient and can ha...
Discriminatively localizing sounding objects in cocktail-party, i.e., mi...
The task of spatial-temporal action detection has attracted increasing
a...
How to visually localize multiple sound sources in unconstrained videos ...
Along with the development of the modern smart city, human-centric video...
Reliable and accurate 3D object detection is a necessity for safe autono...
Semantic scene parsing is suffering from the fact that pixel-level
annot...
Raindrops adhered to a glass window or camera lens can severely hamper t...