Text-conditional diffusion models are able to generate high-fidelity ima...
Existing multimodal conditional image synthesis (MCIS) methods generate
...
Multi-modal reasoning in visual question answering (VQA) has witnessed r...
Recently, significant progress has been made in masked image modeling to...
In this work, we explore neat yet effective Transformer-based frameworks...
We present a method that achieves state-of-the-art results on challengin...
Most current image captioning models typically generate captions from le...
Generating natural language descriptions for videos, i.e., video caption...
Visual attention not only improves the performance of image captioners, ...
We focus on the task of grounding referring expressions in images, e.g.,...
With the maturity of visual detection techniques, we are more ambitious ...
Grounding natural language in images, such as localizing "the black dog ...
Grounding natural language in images essentially requires composite visu...
Many vision-language tasks can be reduced to the problem of sequence
pre...