The ultimate goal for foundation models is realizing task-agnostic, i.e....
Streaming video clips with large-scale video tokens impede vision
transf...
Human visual recognition is a sparse process, where only a few salient v...
Contrastive learning methods train visual encoders by comparing views fr...
Scale is the primary factor for building a powerful foundation model tha...
The relation modeling between actors and scene context advances video ac...
Although the pre-trained Vision Transformers (ViTs) achieved great succe...
Pre-training video transformers on extra large-scale datasets is general...
Vision Transformers (ViTs) take all the image patches as tokens and cons...
Frame sampling is a fundamental problem in video action recognition due ...
Temporal modeling still remains challenging for action recognition in vi...