Recent advances in 3D scene representation and novel view synthesis have...
Deep neural networks (DNNs) usually fail to generalize well to outside o...
Recent advances on text-to-image generation have witnessed the rise of
d...
Video temporal dynamics is conventionally modeled with 3D spatial-tempor...
In this paper, we propose a novel deep architecture tailored for 3D poin...
Recent progress on 2D object detection has featured Cascade RCNN, which
...
Outlier detection tasks have been playing a critical role in AI safety. ...
Multi-scale Vision Transformer (ViT) has emerged as a powerful backbone ...
Prior works have proposed several strategies to reduce the computational...
Motion, as the uniqueness of a video, has been critical to the developme...
Comprehending the rich semantics in an image and ordering them in lingui...
Recent high-performing Human-Object Interaction (HOI) detection techniqu...
This paper presents an overview and comparative analysis of our systems
...
Human actions are typically of combinatorial structures or patterns, i.e...
Vision-language pre-training has been an emerging and fast-developing
re...
Live video broadcasting normally requires a multitude of skills and expe...
Our work reveals a structured shortcoming of the existing mainstream
sel...
Mainstream state-of-the-art domain generalization algorithms tend to
pri...
Self-supervised learning (SSL) has recently become the favorite among fe...
BERT-type structure has led to the revolution of vision-language pre-tra...
Localizing text instances in natural scenes is regarded as a fundamental...
With the rise and development of deep learning over the past decade, the...
Unsupervised learning is just at a tipping point where it could really t...
Transformer with self-attention has led to the revolutionizing of natura...
Despite having impressive vision-language (VL) pretraining with BERT-bas...
This paper explores useful modifications of the recent development in
co...
A steady momentum of innovations and breakthroughs has convincingly push...
Single shot detectors that are potentially faster and simpler than two-s...
In this work, we present Auto-captions on GIF, which is a new large-scal...
Region sampling or weighting is significantly important to the success o...
Unsupervised domain adaptation has received significant attention in rec...
Recent progress on fine-grained visual recognition and visual question
a...
This notebook paper presents an overview and comparative analysis of our...
It is always well believed that parsing an image into constituent visual...
The problem of distance metric learning is mostly considered from the
pe...
Unsupervised image-to-image translation is the task of translating an im...
It has been well recognized that modeling object-to-object relations wou...
It is always well believed that Binary Neural Networks (BNNs) could
dras...
Image paragraph generation is the task of producing a coherent story (us...
This notebook paper presents an overview and comparative analysis of our...
This notebook paper presents an overview and comparative analysis of our...
It is well believed that video captioning is a fundamental but challengi...
Image captioning has received significant attention with remarkable
impr...
Rendering synthetic data (e.g., 3D CAD-rendered images) to generate
anno...
In this paper, we introduce a new idea for unsupervised domain adaptatio...
It is always well believed that modeling relationships between objects w...
In this paper, we introduce the new ideas of augmenting Convolutional Ne...
Hashing has been a widely-adopted technique for nearest neighbor search ...
Automatically describing a video with natural language is regarded as a
...
We are creating multimedia contents everyday and everywhere. While autom...