Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training
While captioning models have obtained compelling results in describing natural images, there is a growing effort to increase their capability of dealing with real-world concepts. In this paper, we address the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human- and automatically-collected captions. To this end, we propose a model which induces a separation between content and descriptive style through the incorporation of stylistic parameters and keywords extracted from large-scale multi-modal models as pivotal data. In terms of visual features, our model avoids the need of object detectors and employs grid-like features together with a single objective of prompt language modeling. Experimentally, we consistently outperform existing methods in terms of caption quality and capability of describing out-of-domain concepts. Finally, our model obtains a new state of the art on both COCO and nocaps.
READ FULL TEXT