Much of the previous work towards digital agents for graphical user
inte...
Recent text-to-image generation models like DreamBooth have made remarka...
Large language models have demonstrated an emergent capability in answer...
Large-scale multi-modal pre-training models such as CLIP and PaLI exhibi...
While language Models store a massive amount of world knowledge implicit...
Research on text-to-image generation has witnessed significant progress ...
The ability to read and reason about texts in an image is often lacking ...
We investigate ways to compose complex concepts in texts from primitive ...
We analyze the grounded SCAN (gSCAN) benchmark, which was recently propo...
Object frequencies in daily scenes follow a long-tailed distribution. Ma...
Identifying a short segment in a long video that semantically matches a ...
Visual Semantic Embedding (VSE) is a dominant approach for vision-langua...
Learning to fuse vision and language information and representing them i...
Continual learning systems will interact with humans, with each other, a...
Learning to follow instructions is of fundamental importance to autonomo...
We propose a learning model for the task of visual storytelling. The mai...
Model-agnostic meta-learners aim to acquire meta-learned parameters from...
Visual recognition in real-world requires handling long-tailed and even
...
The ability to transfer in reinforcement learning is key towards buildin...
Providing systems the ability to relate linguistic and visual content is...
Gradient-based meta-learners such as MAML are able to learn a meta-prior...
Learning with limited data is a key challenge for visual recognition.
Fe...
Standard image captioning tasks such as COCO and Flickr30k are factual,
...
Visual data and text data are composed of information at multiple
granul...
We study three general multi-task learning (MTL) approaches on 11 sequen...
We investigate the problem of cross-dataset adaptation for visual questi...
We propose a novel probabilistic model for visual question answering (Vi...
Visual data such as images and videos contain a rich source of structure...
Training robust deep video representations has proven to be much more
ch...
Visual question answering (QA) has attracted a lot of attention lately, ...
Semantic segmentation requires a detailed labeling of image pixels by ob...
Objects appear to scale differently in natural images. This fact require...
Semantic segmentation requires a detailed labeling of image pixels by ob...
Rich semantic relations are important in a variety of visual recognition...