We present ImageBind, an approach to learn a joint embedding across six
...
This paper revisits the standard pretrain-then-finetune paradigm used in...
Transformer-based architectures have become competitive across a variety...
Prior work has studied different visual modalities in isolation and deve...
Model pre-training is a cornerstone of modern visual recognition systems...
Vision transformer (ViT) models exhibit substandard optimizability. In
p...
Multi-modal reasoning systems rely on a pre-trained object detector to
e...
In this work we analyze strategies for convolutional neural network scal...
Recently, self-supervised learning methods like MoCo, SimCLR, BYOL and S...