In this paper, we propose an autonomous information seeking visual quest...
Observing the close relationship among panoptic, semantic and instance
s...
If you ask a human to describe an image, they might do so in a thousand
...
Detecting actions in untrimmed videos should not be limited to a small,
...
In this paper, we propose an end-to-end Retrieval-Augmented Visual Langu...
Traditional automated metrics for evaluating conditional natural languag...
While there have been significant gains in the field of automated video
...
In this paper, we present a transformer-based learning framework for 3D ...
Videos found on the Internet are paired with pieces of text, such as tit...
Automatic video captioning aims to train models to generate text descrip...
Detecting objects in 3D LiDAR data is a core technology for autonomous
d...
This paper describes the AVA-Kinetics localized human actions video data...
State-of-the-art methods for video action recognition commonly use an
en...
We propose TAL-Net, an improved approach to temporal action localization...
This paper introduces a video dataset of spatio-temporally localized Ato...