Multi-scale features are of great importance in encoding objects with sc...
A surge of interest has emerged in utilizing Transformers in diverse vis...
The problem of deep long-tailed learning, a prevalent challenge in the r...
The remarkable positive impact of Deep Neural Networks on many Artificia...
Considering the computation complexity, we propose a Guided Hybrid
Quant...
Vision transformers (ViTs) have achieved impressive results on various
c...
The last several years have witnessed remarkable progress in
video-and-l...
We present Perceiver-VL, a vision-and-language framework that efficientl...
In this paper, we propose an accurate yet fast small object detection me...
Facial expression is an essential factor in conveying human emotional st...
Training an effective video-and-language model intuitively requires mult...
The goal of this work is to build flexible video-language models that ca...
We introduce an audiovisual method for long-range text-to-video retrieva...
Convolutional Neural Network (CNN), which mimics human visual perception...
Dual encoders and cross encoders have been widely used for image-text
re...
In most video platforms, such as Youtube, and TikTok, the played videos
...
Given a reference object of an unknown type in an image, human observers...
The microvascular invasion (MVI) is a major prognostic factor in
hepatoc...
We introduce mTVR, a large-scale multilingual video moment retrieval dat...
Detecting customized moments and highlights from videos given natural
la...
Video understanding relies on perceiving the global content and modeling...
Most existing video-and-language (VidL) research focuses on a single dat...
With large-scale pre-training, the past two years have witnessed signifi...
The canonical approach to video-and-language learning (e.g., video quest...
Existing methods for vision-and-language learning typically require desi...
The emergence of Artificial Intelligence (AI) driven Keyword Spotting (K...
Given a video with aligned dialogue, people can often infer what is more...
Due to the high energy consumption and scalability challenges of deep
le...
Generating multi-sentence descriptions for videos is one of the most
cha...
We introduce a new multimodal retrieval task - TV show Retrieval (TVR), ...
We present the task of Spatio-Temporal Video Question Answering, which
r...
Recent years have witnessed an increasing interest in image-based
questi...
In this paper, we introduce a selective zero-shot classification problem...
As an effective way of metric learning, triplet loss has been widely use...