Attention-based models are appealing for multimodal processing because i...
Generic motion understanding from video involves not only tracking objec...
General perception systems such as Perceivers can process arbitrary
moda...
The ability to learn universal audio representations that can solve dive...
We present a multimodal framework to learn general audio representations...
Most successful self-supervised learning methods are trained to align th...
Fine-grained estimation of galaxy merger stages from observations is a k...
The rapid progress in artificial intelligence (AI) and machine learning ...
Videos are a rich source of multi-modal supervision. In this work, we le...
In our everyday lives and social interactions we often try to perceive t...
Understanding where people are looking is an informative social cue. In ...
We introduce a saliency-based distortion layer for convolutional neural
...
Widely used in news, business, and educational media, infographics are
h...
In this paper, we explore neural network models that learn to associate
...
We introduce the problem of visual hashtag discovery for infographics:
e...
Following the gaze of people inside videos is an important signal for
un...