This paper presents a comprehensive survey of the taxonomy and evolution...
Avoiding synthesizing specific visual concepts is an essential challenge...
In this paper, we study the denoising diffusion probabilistic model (DDP...
Generative AI has made significant strides in computer vision, particula...
Despite the promising progress in multi-modal tasks, current large
multi...
Multimodal summarization with multimodal output (MSMO) has emerged as a
...
We present a unified framework for camera-space 3D hand pose estimation ...
Model merging (e.g., via interpolation or task arithmetic) fuses multipl...
We develop the contour integral method for numerically solving the
Feynm...
Spatial control is a core capability in controllable image generation.
A...
The most recent efforts in video matting have focused on eliminating tri...
This study explores the concept of equivariance in vision-language found...
We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...
3D photography renders a static image into a video with appealing 3D vis...
We present X-Decoder, a generalized decoding model that can predict
pixe...
This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...
We present Mesh Pre-Training (MPT), a new pre-training framework that
le...
The image captioning task is typically realized by an auto-regressive me...
Contrastive language-image pre-training (CLIP) serves as a de-facto stan...
This paper surveys vision-language pre-training (VLP) methods for multim...
Large language models (LLMs) show impressive abilities via few-shot
prom...
Masked visual modeling (MVM) has been recently proven effective for visu...
In this paper, we present NUWA-Infinity, a generative model for infinite...
Vision-language (VL) pre-training has recently received considerable
att...
Unified vision-language frameworks have greatly advanced in recent years...
We present GLIPv2, a grounded VL understanding model, that serves both
l...
In this paper, we design and train a Generative Image-to-text Transforme...
We present a cross-modal Transformer-based framework, which jointly enco...
Recent state-of-the-art computer vision systems are trained from natural...
Human-Object Interaction (HOI) recognition is challenging due to two fac...
We propose DEFR, a DEtection-FRee method to recognize Human-Object
Inter...
Tremendous progress has been made in recent years in developing better i...
We initiate the first empirical study on the use of MLP architectures fo...
This paper presents a grounded language-image pre-training (GLIP) model ...
The canonical approach to video captioning dictates a caption generation...
A great challenge in video-language (VidL) modeling lies in the disconne...
In recent years, we have witnessed significant performance boost in the ...
In this paper, we propose UNICORN, a vision-language (VL) model that uni...
Automated visual understanding of our diverse and open world demands com...
In this paper, we propose a single UniFied transfOrmer (UFO), which is
c...
Vision-and-language (VL) pre-training has proven to be highly effective ...
Knowledge-based visual question answering (VQA) involves answering quest...
We introduce the task of open-vocabulary visual instance search (OVIS). ...
This paper revisits human-object interaction (HOI) recognition at image ...
This paper presents an end-to-end semi-supervised object detection appro...
Most existing video-and-language (VidL) research focuses on a single dat...
Despite exciting progress in pre-training for visual-linguistic (VL)
rep...
We present a graph-convolution-reinforced transformer, named Mesh Grapho...
This paper presents a detection-aware pre-training (DAP) approach, which...
Recent advances in computer vision take advantage of adversarial data
au...