Zhengyuan Yang

research

∙ 09/18/2023

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

This paper presents a comprehensive survey of the taxonomy and evolution...

0 Chunyuan Li, et al. ∙

research

∙ 07/27/2023

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

In this paper, we study the denoising diffusion probabilistic model (DDP...

0 Xin Yuan, et al. ∙

research

∙ 06/30/2023

DisCo: Disentangled Control for Referring Human Dance Generation in Real World

Generative AI has made significant strides in computer vision, particula...

0 Tan Wang, et al. ∙

research

∙ 06/07/2023

MultiSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

Multimodal summarization with multimodal output (MSMO) has emerged as a ...

0 Jielin Qiu, et al. ∙

research

∙ 04/13/2023

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

Spatial control is a core capability in controllable image generation. A...

4 Jaemin Cho, et al. ∙

research

∙ 03/25/2023

Equivariant Similarity for Vision-Language Foundation Models

This study explores the concept of equivariance in vision-language found...

0 Tan Wang, et al. ∙

research

∙ 03/20/2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

We propose MM-REACT, a system paradigm that integrates ChatGPT with a po...

0 Zhengyuan Yang, et al. ∙

research

∙ 03/20/2023

Revisiting Transformer for Point Cloud-based 3D Scene Graph Generation

In this paper, we propose the semantic graph Transformer (SGT) for the 3...

0 Changsheng Lv, et al. ∙

research

∙ 02/21/2023

Learning 3D Photography Videos via Self-supervised Diffusion on Single Images

3D photography renders a static image into a video with appealing 3D vis...

0 Xiaodong Wang, et al. ∙

research

∙ 12/01/2022

GRiT: A Generative Region-to-text Transformer for Object Understanding

This paper presents a Generative RegIon-to-Text transformer, GRiT, for o...

0 Jialian Wu, et al. ∙

research

∙ 11/15/2022

PromptCap: Prompt-Guided Task-Aware Image Captioning

Image captioning aims to describe an image with a natural language sente...

10 Yushi Hu, et al. ∙

research

∙ 10/17/2022

Prompting GPT-3 To Be Reliable

Large language models (LLMs) show impressive abilities via few-shot prom...

0 Chenglei Si, et al. ∙

research

∙ 06/14/2022

TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer

In this work, we explore neat yet effective Transformer-based frameworks...

19 Jiajun Deng, et al. ∙

research

∙ 05/27/2022

GIT: A Generative Image-to-text Transformer for Vision and Language

In this paper, we design and train a Generative Image-to-text Transforme...

14 Jianfeng Wang, et al. ∙

research

∙ 01/18/2022

Cross-modal Contrastive Distillation for Instructional Activity Anticipation

In this study, we aim to predict the plausible future action steps given...

0 Zhengyuan Yang, et al. ∙

research

∙ 11/24/2021

Scaling Up Vision-Language Pre-training for Image Captioning

In recent years, we have witnessed significant performance boost in the ...

0 Xiaowei Hu, et al. ∙

research

∙ 11/23/2021

Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling

In this paper, we propose UNICORN, a vision-language (VL) model that uni...

7 Zhengyuan Yang, et al. ∙

research

∙ 11/19/2021

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

In this paper, we propose a single UniFied transfOrmer (UFO), which is c...

0 Jianfeng Wang, et al. ∙

research

∙ 09/10/2021

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

Knowledge-based visual question answering (VQA) involves answering quest...

0 Zhengyuan Yang, et al. ∙

research

∙ 04/17/2021

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...

0 Jiajun Deng, et al. ∙

research

∙ 12/08/2020

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and...

0 Zhengyuan Yang, et al. ∙

research

∙ 10/30/2020

Pose-based Body Language Recognition for Emotion and Psychiatric Symptom Interpretation

Inspired by the human ability to infer emotions from body language, we p...

0 Zhengyuan Yang, et al. ∙

research

∙ 09/04/2020

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

Multimodal machine translation (MMT), which mainly focuses on enhancing ...

0 Huan Lin, et al. ∙

research

∙ 08/03/2020

Improving One-stage Visual Grounding by Recursive Sub-query Construction

We improve one-stage visual grounding by addressing current limitations ...

2 Zhengyuan Yang, et al. ∙

research

∙ 07/17/2020

A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation

Multi-modal neural machine translation (NMT) aims to translate source se...

0 Yongjing Yin, et al. ∙

research

∙ 07/03/2020

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation

Weakly supervised phrase grounding aims at learning region-phrase corres...

0 Liwei Wang, et al. ∙

research

∙ 12/13/2019

Grounding-Tracking-Integration

In this paper, we study tracking by language that localizes the target b...

10 Zhengyuan Yang, et al. ∙

research

∙ 08/18/2019

A Fast and Accurate One-Stage Approach to Visual Grounding

We propose a simple, fast, and accurate one-stage approach to visual gro...

1 Zhengyuan Yang, et al. ∙

research

∙ 07/30/2019

Weakly Supervised Body Part Parsing with Pose based Part Priors

Human body part parsing refers to the task of predicting the semantic se...

1 Zhengyuan Yang, et al. ∙

research

∙ 04/27/2019

Human-Centered Emotion Recognition in Animated GIFs

As an intuitive way of expression emotion, the animated Graphical Interc...

0 Zhengyuan Yang, et al. ∙

research

∙ 11/26/2018

Attentive Relational Networks for Mapping Images to Scene Graphs

Scene graph generation refers to the task of automatically mapping an im...

0 Mengshi Qi, et al. ∙

research

∙ 01/31/2018

Action Recognition with Visual Attention on Skeleton Images

Action recognition with 3D skeleton sequences is becoming popular due to...

0 Zhengyuan Yang, et al. ∙

research

∙ 01/20/2018

End-to-end Multi-Modal Multi-Task Vehicle Control for Self-Driving Cars with Visual Perception

Convolutional Neural Networks (CNN) have been successfully applied to au...

0 Zhengyuan Yang, et al. ∙

Zhengyuan Yang

Featured Co-authors

Sign in with Google

Consider DeepAI Pro