How to efficiently transform large language models (LLMs) into instructi...
Contrastive learning-based vision-language pre-training approaches, such...
We present Mono-STAR, the first real-time 3D reconstruction system that
...
Video recognition has been dominated by the end-to-end learning paradigm...
Despite the success of fully-supervised human skeleton sequence modeling...
Group Activity Recognition (GAR) detects the activity performed by a gro...
The visual world naturally exhibits a long-tailed distribution of open
c...
In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (...
Inspired by the success of BERT, several multimodal representation learn...
Transformer has been widely adopted in Neural Machine Translation (NMT)
...
In contrast with previous approaches where information flows only toward...
Several multi-modality representation learning approaches such as LXMERT...
The Audio-Visual Scene-aware Dialog (AVSD) task requires an agent to ind...
There has been growing attention on fairness considerations recently,
es...
A number of cross-lingual transfer learning approaches based on neural
n...
We present a simple method that achieves unexpectedly superior performan...
We design a new connectivity pattern for the U-Net architecture. Given
s...
In this paper, we propose quantized densely connected U-Nets for efficie...