PreSTU: Pre-Training for Scene-Text Understanding

09/12/2022
by   Jihyung Kil, et al.
0

The ability to read and reason about texts in an image is often lacking in vision-and-language (V L) models. How can we learn V L models that exhibit strong scene-text understanding (STU)? In this paper, we propose PreSTU, a simple pre-training recipe specifically designed for scene-text understanding. PreSTU combines a simple OCR-aware pre-training objective with a large-scale image-text dataset with off-the-shelf OCR signals. We empirically demonstrate the superiority of this pre-training objective on TextVQA, TextCaps, ST-VQA, and VizWiz-VQA. We also study which factors affect STU performance, where we highlight the importance of image resolution and dataset scale during pre-training.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset