Recent empirical evidence indicates that transformer based in-context
le...
We study how vision-language models trained on Internet-scale data can b...
We propose Video Localized Narratives, a new form of multimodal video
an...
Recent research in robust optimization has shown an overfitting-like
phe...
Effective scaling and a flexible task interface enable large language mo...
The ability to read and reason about texts in an image is often lacking ...
Visual Question Answering (VQA) has been primarily studied through the l...
Research in massively multilingual image captioning has been severely
ha...
Visual Question Answering (VQA) has benefited from increasingly sophisti...
Dense video captioning aims to identify the events of interest in an inp...
With the increasing abundance of pretrained models in recent years, the
...
We describe an efficient hierarchical method to compute attention in the...
Despite recent advances in its theoretical understanding, there still re...
The availability of large-scale image captioning and visual question
ans...
Existing image retrieval systems use text queries to provide a natural a...
Image captioning models generally lack the capability to take into accou...
Learning specific hands-on skills such as cooking, car maintenance, and ...
Recent advances in automatic evaluation metrics for text have shown that...
Sequence generation models trained with teacher-forcing suffer from issu...
Models based on the Transformer architecture have achieved better accura...
Image captioning involves identifying semantic concepts in the scene and...
Multi-sentence summarization is a well studied problem in NLP, while
gen...
Cross-modal language generation tasks such as image captioning are direc...
Pretraining from unlabelled web videos has quickly become the de-facto m...
We propose Localized Narratives, an efficient way to collect image capti...
Human ratings are currently the most accurate way to assess the quality ...
Instructional videos get high-traffic on video sharing platforms, and pr...
Increasing model size when pretraining natural language representations ...
Neural models for abstractive summarization tend to achieve the best
per...
Automatic image captioning has improved significantly in the last few ye...
Object detection plays an important role in current solutions to vision ...
An image caption should fluently present the essential information in a ...
Supervised training of abstractive language generation models results in...
We introduce a new multi-modal task for computer systems, posed as a com...
We present a family of neural-network--inspired models for computing
con...
We present a dual contribution to the task of machine reading-comprehens...
Morpho-syntactic lexicons provide information about the morphological an...