Self-supervised visual pretraining has shown significant progress recent...
Audio Visual Scene-aware Dialog (AVSD) is a task to generate responses w...
Computational audio analysis has become a central issue in associated ar...
Building a good speech recognition system usually requires large amounts...
Multimodalities provide promising performance than unimodality in most t...
Speech recognition technologies are gaining enormous popularity in vario...
Recent studies in sequence-to-sequence learning demonstrate that RNN
enc...