Automatically determining whether a text and a corresponding image are
s...
Large language models have demonstrated an emergent capability in answer...
We propose Video Localized Narratives, a new form of multimodal video
an...
Creativity is an indispensable part of human cognition and also an inher...
Effective scaling and a flexible task interface enable large language mo...
The ability to read and reason about texts in an image is often lacking ...
Visual Question Answering (VQA) has been primarily studied through the l...
Visual Question Answering (VQA) has benefited from increasingly sophisti...
The availability of large-scale image captioning and visual question
ans...
Object frequencies in daily scenes follow a long-tailed distribution. Ma...
Existing image retrieval systems use text queries to provide a natural a...
Image captioning involves identifying semantic concepts in the scene and...
We propose Localized Narratives, an efficient way to collect image capti...
Object detection plays an important role in current solutions to vision ...
Zero-shot learning (ZSL) enables solving a task without the need to see ...
We study three general multi-task learning (MTL) approaches on 11 sequen...
Deep convolutional networks are well-known for their high computational ...
Leveraging class semantic descriptions and examples of known objects,
ze...
Given semantic descriptions of object classes, zero-shot learning aims t...