We propose LENS, a modular approach for tackling computer vision problem...
Large multimodal models trained on natural documents, which interleave i...
The volume of scientific output is creating an urgent need for automated...
Learned representations of scientific documents can serve as valuable in...
The vast scale and open-ended nature of knowledge graphs (KGs) make
expl...
Training and inference with large neural models is expensive. However, f...
We present a novel task and dataset for evaluating the ability of vision...
Vision-and-Language (V+L) pre-training models have achieved tremendous
s...
State-of-the-art vision and vision-and-language models rely on large-sca...
Performance on the most commonly used Visual Question Answering dataset ...
A crucial component for the scene text based reasoning required for Text...
Optimal Mass Transport (OMT) is a well studied problem with a variety of...
We propose UniT, a Unified Transformer model to simultaneously learn the...
A major challenge in fine-tuning deep learning models for automatic
summ...
We introduce a learning-based approach for room navigation using semanti...
This work proposes a new challenge set for multimodal classification,
fo...
Numerous recent works have proposed pretraining generic visio-linguistic...
Image descriptions can help visually impaired people to quickly understa...
Many visual scenes contain text that carries crucial information, and it...
In the last year, new models and methods for pretraining and transfer
le...
Studies have shown that a dominant class of questions asked by visually
...
In model-based reinforcement learning, the agent interleaves between mod...
Learning when to communicate and doing that effectively is essential in
...
We propose a new recurrent generative model for generating images from t...
In this work, we explore the ability of artificial neural networks to ju...
Finding patterns in data and being able to retrieve information from tho...