The aim of audio-visual segmentation (AVS) is to precisely differentiate...
We study the image-based geolocalization problem that aims to locate
gro...
In this work, we propose a new transformer-based regularization to bette...
We present TransNormerLLM, the first linear attention-based Large Langua...
Length extrapolation has attracted considerable attention recently since...
Relative positional encoding is widely used in vanilla and linear
transf...
Sequence modeling has important applications in natural language process...
The Segment Anything Model (SAM) has demonstrated exceptional performanc...
We explore a new task for audio-visual-language modeling called fine-gra...
Self-supervised audio-visual source localization aims to locate sound-so...
Audio-Visual Video Parsing is a task to predict the events that occur in...
Linear transformers aim to reduce the quadratic space-time complexity of...
Vision Transformers have achieved impressive performance in video
classi...
Recently, numerous efficient Transformers have been proposed to reduce t...
Fine-tuning is widely applied in image classification tasks as a transfe...
The self-attention mechanism, successfully employed with the transformer...
Vision transformers have shown great success on numerous computer vision...
Conformer has shown a great success in automatic speech recognition (ASR...
We propose a new video camouflaged object detection (VCOD) framework tha...
Transformer has shown great successes in natural language processing,
co...
This work studies the task of glossification, of which the aim is to em
...
The task of semi-supervised video object segmentation (VOS) has been gre...
Existing RGB-D saliency detection models do not explicitly encourage RGB...
Camouflaged object detection (COD) aims to segment camouflaged objects h...
Attention has been proved to be an efficient mechanism to capture long-r...
Single-image super-resolution (SR) and multi-frame SR are two ways to su...
Two-view structure-from-motion (SfM) is the cornerstone of 3D reconstruc...
Visual and audio signals often coexist in natural environments, forming
...
Video deblurring models exploit consecutive frames to remove blurs from
...
A depth map can be represented by a set of learned bases and can be
effi...
In this paper, we propose a new global geometry constraint for depth
com...
Although deep learning-based methods have dominated stereo matching
lead...
Learning matching costs has been shown to be critical to the success of ...
To reduce the human efforts in neural network design, Neural Architectur...
Existing deep learning methods for image deblurring typically train mode...
In this paper, we present LidarStereoNet, the first unsupervised Lidar-s...
Unsupervised deep learning for optical flow computation has achieved
pro...
This paper proposes an original problem of stereo computation from a
sin...
This paper is concerned with the problem of how to better exploit 3D
geo...
Deep Learning based stereo matching methods have shown great successes a...
Camera shake or target movement often leads to undesired blur effects in...
Exiting deep-learning based dense stereo matching methods often rely on
...
Feature tracking is a fundamental problem in computer vision, with
appli...