Existing emotion prediction benchmarks contain coarse emotion labels whi...
Long-term activity forecasting is an especially challenging research pro...
In egocentric action recognition a single population model is typically
...
We propose a self-supervised approach for learning to perform audio sour...
Recent self-supervised approaches have used large-scale image-text datas...
We introduce the task of spatially localizing narrated interactions in
v...
Large-scale dissemination of disinformation online intended to mislead o...
Given a video and a sentence, the goal of weakly-supervised video moment...
Many real-world tasks require models to compare images along multiple
si...
Shouldn't language and vision features be treated equally in vision-lang...