With ever increasing parameters and computation, vision-language pre-tra...
Given the long textual product information and the product image, Multi-...
Deep neural networks often suffer from poor generalization due to comple...
Prompt tuning is a parameter-efficient way to deploy large-scale pre-tra...
Pre-trained language models (PLMs) have played an increasing role in
mul...
Recently, growing interest has been aroused in extending the multimodal
...
In this paper, we study teacher-student learning from the perspective of...
Semi-supervised object detection (SSOD) is a research hot spot in comput...
Parameter-efficient transfer learning (PETL) is an emerging research spo...
In this paper, we study the local visual modeling with grid features for...
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding ...
Deep neural networks often suffer from poor generalization caused by com...
Visible-infrared person re-identification (VI-ReID) is a task of matchin...
Most of the existing work in one-stage referring expression comprehensio...
Despite the exciting performance, Transformer is criticized for its exce...
Pixel synthesis is a promising research paradigm for image generation, w...
In this paper, we propose a simple yet universal network termed SeqTR fo...
In this paper, we discover two factors that inhibit POMs from achieving ...
In this paper, we are committed to establishing an unified and end-to-en...
Referring expression comprehension (REC) and segmentation (RES) are two
...
Referring Expression Comprehension (REC) is an emerging research spot in...