Implicit neural representations (INR) have gained increasing attention i...
Video understanding tasks take many forms, from action detection to visu...
Contrastive Language-Image Pretraining (CLIP) has demonstrated impressiv...
We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mas...
Weakly-supervised temporal action localization aims to recognize and loc...
Video transformers have achieved impressive results on major video
recog...
We study the training of Vision Transformers for semi-supervised image
c...
The standard way of training video models entails sampling at each itera...
Self-attention learns pairwise interactions via dot products to model
lo...
One central question for video action recognition is how to model motion...
Retrieval networks are essential for searching and indexing. Compared to...
Recognizing objects from subcategories with very subtle differences rema...
In this paper, we propose Spatio-TEmporal Progressive (STEP) action
dete...
We cast visual retrieval as a regression problem by posing triplet loss ...
We present a self-supervised approach using spatio-temporal signals betw...
We propose a novel deep neural network architecture for the challenging
...
Sparsity learning with known grouping structures has received considerab...
In recent years, Deep Learning has been successfully applied to multimod...