Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

02/19/2023
by   Yimu Wang, et al.
0

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro