Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

07/21/2023
by   Mayug Maniparambil, et al.
0

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing prompts that are relevant to the dataset. Such prompt engineering makes use of domain expertise and a validation dataset. Meanwhile, recent developments in generative pretrained models like GPT-4 mean they can be used as advanced internet search tools. They can also be manipulated to provide visual information in any structure. In this work, we show that GPT-4 can be used to generate text that is visually descriptive and how this can be used to adapt CLIP to downstream tasks. We show considerable improvements in 0-shot transfer accuracy on specialized fine-grained datasets like EuroSAT ( 7 ( 7 We also design a simple few-shot adapter that learns to choose the best possible sentences to construct generalizable classifiers that outperform the recently proposed CoCoOP by  2 fine-grained datasets. We will release the code, prompts, and auxiliary text dataset upon acceptance.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset