We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for...
We introduce OpenFlamingo, a family of autoregressive vision-language mo...
The prevalence of large-scale multimodal datasets presents unique challe...
Automatically determining whether a text and a corresponding image are
s...
One of the exciting capabilities of recent language models for dialog is...
Large multimodal datasets have been instrumental in recent breakthroughs...
Figures of speech such as metaphors, similes, and idioms allow language ...
Weird, unusual, and uncanny images pique the curiosity of observers beca...
A core process in human cognition is analogical mapping: the ability to
...
While vision-and-language models perform well on tasks such as visual
qu...
Masked language modeling (MLM) is one of the key sub-tasks in vision-lan...
Recent works have shown that supervised models often exploit data artifa...