Semi-Supervised Image Captioning with CLIP

06/26/2023
by   Chuanyang Jin, et al.
0

Image captioning, a fundamental task in vision-language understanding, seeks to generate accurate natural language descriptions for provided images. The CLIP model, with its rich semantic features learned from a large corpus of image-text pairs, is well-suited for this task. In this paper, we present a two-stage semi-supervised image captioning approach that exploits the potential of CLIP encoding. Our model comprises a CLIP visual encoder, a mapping network, and a language model for text generation. In the initial stage, we train the model using a small labeled dataset by contrasting the generated captions with the ground truth captions. In the subsequent stage, we continue the training using unlabeled images, aiming to maximize the image-caption similarity based on CLIP embeddings. Remarkably, despite utilizing less than 2 COCO-captions, our approach delivers a performance comparable to state-of-the-art models trained on the complete dataset. Furthermore, the captions generated by our approach are more distinctive, informative, and in line with human preference.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset