This paper presents a comprehensive survey of the taxonomy and evolution...
In this paper, we study the denoising diffusion probabilistic model (DDP...
Generative AI has made significant strides in computer vision, particula...
Multimodal summarization with multimodal output (MSMO) has emerged as a
...
Spatial control is a core capability in controllable image generation.
A...
This study explores the concept of equivariance in vision-language found...
We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...
In this paper, we propose the semantic graph Transformer (SGT) for the 3...
3D photography renders a static image into a video with appealing 3D vis...
This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...
Image captioning aims to describe an image with a natural language sente...
Large language models (LLMs) show impressive abilities via few-shot
prom...
In this work, we explore neat yet effective Transformer-based frameworks...
In this paper, we design and train a Generative Image-to-text Transforme...
In this study, we aim to predict the plausible future action steps given...
In recent years, we have witnessed significant performance boost in the ...
In this paper, we propose UNICORN, a vision-language (VL) model that uni...
In this paper, we propose a single UniFied transfOrmer (UFO), which is
c...
Knowledge-based visual question answering (VQA) involves answering quest...
In this paper, we present a neat yet effective transformer-based framewo...
In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and...
Inspired by the human ability to infer emotions from body language, we
p...
Multimodal machine translation (MMT), which mainly focuses on enhancing
...
We improve one-stage visual grounding by addressing current limitations ...
Multi-modal neural machine translation (NMT) aims to translate source
se...
Weakly supervised phrase grounding aims at learning region-phrase
corres...
In this paper, we study tracking by language that localizes the target b...
We propose a simple, fast, and accurate one-stage approach to visual
gro...
Human body part parsing refers to the task of predicting the semantic
se...
As an intuitive way of expression emotion, the animated Graphical Interc...
Scene graph generation refers to the task of automatically mapping an im...
Action recognition with 3D skeleton sequences is becoming popular due to...
Convolutional Neural Networks (CNN) have been successfully applied to
au...