We present a novel model for Tracking Any Point (TAP) that effectively t...
We propose a novel multimodal video benchmark - the Perception Test - to...
Attention-based models are appealing for multimodal processing because i...
Generic motion understanding from video involves not only tracking objec...
Videos contain far more information than still images and hold the poten...
Experience and reasoning occur across multiple temporal scales: millisec...
Self-supervised methods have achieved remarkable success in transfer
lea...
We present a general-purpose framework for image modelling and vision ta...
The promise of self-supervised learning (SSL) is to leverage large amoun...
General perception systems such as Perceivers can process arbitrary
moda...
Real-world data is high-dimensional: a book, image, or musical performan...
Much of the recent progress in 3D vision has been driven by the developm...
The ability to learn universal audio representations that can solve dive...
The recently-proposed Perceiver model obtains good results on several do...
Self-supervised pretraining has been shown to yield powerful representat...
Biological systems understand the world by simultaneously processing
hig...
We describe the 2020 edition of the DeepMind Kinetics human action datas...
This paper describes the AVA-Kinetics localized human actions video data...
There are thousands of actively spoken languages on Earth, but a single
...
The objective of this paper is to be able to separate a video into its
n...
We describe an extension of the DeepMind Kinetics human action dataset f...
Serverless cloud computing handles virtually all the system administrati...
We introduce the Action Transformer model for recognizing and localizing...
True video understanding requires making sense of non-lambertian scenes ...
We describe an extension of the DeepMind Kinetics human action dataset f...
We introduce a simple baseline for action localization on the AVA datase...
We introduce a class of causal video understanding models that aims to
i...
The paucity of videos in current action classification datasets (UCF-101...
We describe the DeepMind Kinetics human action video dataset. The datase...
Hierarchical feature extractors such as Convolutional Networks (ConvNets...
The dominant paradigm for feature learning in computer vision relies on
...
We address the task of predicting pose for objects of unannotated object...
While data has certainly taken the center stage in computer vision in re...
All that structure from motion algorithms "see" are sets of 2D points. W...
Object reconstruction from a single image -- in the wild -- is a problem...
We propose a mid-level image segmentation framework that combines multip...