MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

by   Zizhang Li, et al.

Referring image segmentation is a typical multi-modal task, which aims at generating a binary mask for referent described in given language expressions. Prior arts adopt a bimodal solution, taking images and languages as two modalities within an encoder-fusion-decoder pipeline. However, this pipeline is sub-optimal for the target task for two reasons. First, they only fuse high-level features produced by uni-modal encoders separately, which hinders sufficient cross-modal learning. Second, the uni-modal encoders are pre-trained independently, which brings inconsistency between pre-trained uni-modal tasks and the target multi-modal task. Besides, this pipeline often ignores or makes little use of intuitively beneficial instance-level features. To relieve these problems, we propose MaIL, which is a more concise encoder-decoder pipeline with a Mask-Image-Language trimodal encoder. Specifically, MaIL unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder, facilitating sufficient feature interaction across different modalities. Meanwhile, MaIL directly avoids the second limitation since no uni-modal encoders are needed anymore. Moreover, for the first time, we propose to introduce instance masks as an additional modality, which explicitly intensifies instance-level features and promotes finer segmentation results. The proposed MaIL set a new state-of-the-art on all frequently-used referring image segmentation datasets, including RefCOCO, RefCOCO+, and G-Ref, with significant gains, 3 released soon.


page 1

page 4

page 8


LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that ...

MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

Infrared and visible image fusion can compensate for the incompleteness ...

Stroke Constrained Attention Network for Online Handwritten Mathematical Expression Recognition

In this paper, we propose a novel stroke constrained attention network (...

ParaColorizer: Realistic Image Colorization using Parallel Generative Networks

Grayscale image colorization is a fascinating application of AI for info...

OctopusNet: A Deep Learning Segmentation Network for Multi-modal Medical Images

Deep learning models, such as the fully convolutional network (FCN), hav...

Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation

Referring image segmentation segments an image from a language expressio...

Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation

Scene Graph Generation, which generally follows a regular encoder-decode...

Please sign up or login with your details

Forgot password? Click here to reset