Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation

by   Zhao Yang, et al.

Referring image segmentation segments an image from a language expression. With the aim of producing high-quality masks, existing methods often adopt iterative learning approaches that rely on RNNs or stacked attention layers to refine vision-language features. Despite their complexity, RNN-based methods are subject to specific encoder choices, while attention-based methods offer limited gains. In this work, we introduce a simple yet effective alternative for progressively learning discriminative multi-modal features. The core idea of our approach is to leverage a continuously updated query as the representation of the target object and at each iteration, strengthen multi-modal features strongly correlated to the query while weakening less related ones. As the query is initialized by language features and successively updated by object features, our algorithm gradually shifts from being localization-centric to segmentation-centric. This strategy enables the incremental recovery of missing object parts and/or removal of extraneous parts through iteration. Compared to its counterparts, our method is more versatilex2014it can be plugged into prior arts straightforwardly and consistently bring improvements. Experimental results on the challenging datasets of RefCOCO, RefCOCO+, and G-Ref demonstrate its advantage with respect to the state-of-the-art methods.


page 1

page 3

page 7

page 10

page 11


Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Recently, referring image segmentation has aroused widespread interest. ...

Gumbel-Attention for Multi-modal Machine Translation

Multi-modal machine translation (MMT) improves translation quality by in...

CMF: Cascaded Multi-model Fusion for Referring Image Segmentation

In this work, we address the task of referring image segmentation (RIS),...

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

Referring image segmentation aims to segment an object mentioned in natu...

MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

Referring image segmentation is a typical multi-modal task, which aims a...

Mutual Query Network for Multi-Modal Product Image Segmentation

Product image segmentation is vital in e-commerce. Most existing methods...

Please sign up or login with your details

Forgot password? Click here to reset