Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search
Text-based person search is a challenging task that aims to search pedestrian images with the same identity from the image gallery given a query text description. In recent years, text-based person search has made good progress, and state-of-the-art methods achieve superior performance by learning local fine-grained correspondence between images and texts. However, the existing methods explicitly extract image parts and text phrases from images and texts by hand-crafted split or external tools and then conduct complex cross-modal local matching. Moreover, the existing methods seldom consider the problem of information inequality between modalities caused by image-specific information. In this paper, we propose an efficient joint Information and Semantic Alignment Network (ISANet) for text-based person search. Specifically, we first design an image-specific information suppression module, which suppresses image background and environmental factors by relation-guide localization and channel attention filtration respectively. This design can effectively alleviate the problem of information inequality and realize the information alignment between images and texts. Secondly, we propose an implicit local alignment module to adaptively aggregate image and text features to a set of modality-shared semantic topic centers, and implicitly learn the local fine-grained correspondence between images and texts without additional supervision information and complex cross-modal interactions. Moreover, a global alignment is introduced as a supplement to the local perspective. Extensive experiments on multiple databases demonstrate the effectiveness and superiority of the proposed ISANet.
READ FULL TEXT