In this paper, we propose a novel cross-modal distillation method, calle...
Transformer-based visual trackers have demonstrated significant progress...
While language-guided image manipulation has made remarkable progress, t...
Vision transformers have shown great success due to their high model
cap...
In this paper, we present a new sequence-to-sequence learning framework ...
Image token removal is an efficient augmentation strategy for reducing t...
Vision transformer (ViT) recently has drawn great attention in computer
...
PointNet++ is one of the most influential neural architectures for point...
Vision Transformer (ViT) models have recently drawn much attention in
co...
Vision Transformer has shown great visual representation power in substa...
In this paper, we propose to learn an Unsupervised Single Object Tracker...
Relative position encoding (RPE) is important for transformer to capture...
Recently, pure transformer-based models have shown great potentials for
...
Vision-Language Pre-training (VLP) aims to learn multi-modal representat...
Object tracking has achieved significant progress over the past few year...
Despite remarkable progress achieved, most neural architecture search (N...
In this paper, we present a new tracking architecture with an encoder-de...
We address the problem of retrieving a specific moment from an untrimmed...
One-shot weight sharing methods have recently drawn great attention in n...
Most of the current action localization methods follow an anchor-based
p...
The encoding of the target in object tracking moves from the coarse
boun...
Recently, differentiable architecture search has draw great attention du...
Anchor-based Siamese trackers have achieved remarkable advancements in
a...
In this report, we introduce the Winner method for HACS Temporal Action
...
We address the problem of retrieving a specific moment from an untrimmed...