Cross-modal retrieval (CMR) has been extensively applied in various doma...
Most existing sandstorm image enhancement methods are based on tradition...
This paper integrates graph-to-sequence into an end-to-end text-to-speec...
Generating realistic talking faces is a complex and widely discussed tas...
The rise of the phenomenon of the "right to be forgotten" has prompted
r...
Voice conversion is a method that allows for the transformation of speak...
Music Emotion Recognition involves the automatic identification of emoti...
In the realm of Large Language Models, the balance between instruction d...
Voice conversion as the style transfer task applied to speech, refers to...
Chinese Automatic Speech Recognition (ASR) error correction presents
sig...
Conversational Question Answering (CQA) is a challenging task that aims ...
In order to construct or extend entity-centric and event-centric knowled...
There has been significant progress in emotional Text-To-Speech (TTS)
sy...
In recent Text-to-Speech (TTS) systems, a neural vocoder often generates...
Deep neural retrieval models have amply demonstrated their power but
est...
Deep neural networks have achieved remarkable performance in retrieval-b...
Because of predicting all the target tokens in parallel, the
non-autoreg...
Recent expressive text to speech (TTS) models focus on synthesizing emot...
Music genre classification has been widely studied in past few years for...
Zero-shot information extraction (IE) aims to build IE systems from the
...
The recent emergence of joint CTC-Attention model shows significant
impr...
Recent advances in pre-trained language models have improved the perform...
Most previous neural text-to-speech (TTS) methods are mainly based on
su...
Metaverse expands the physical world to a new dimension, and the physica...
Recovering the masked speech frames is widely applied in speech
represen...
In this paper, we proposed Adapitch, a multi-speaker TTS method that mak...
Since the beginning of the COVID-19 pandemic, remote conferencing and
sc...
Buddhism is an influential religion with a long-standing history and pro...
Nonparallel multi-domain voice conversion methods such as the StarGAN-VC...
One-shot voice conversion (VC) with only a single target speaker's speec...
Non-parallel many-to-many voice conversion remains an interesting but
ch...
Time-domain Transformer neural networks have proven their superiority in...
Although deep Neural Networks (DNNs) have achieved tremendous success in...
In this paper, we investigated a speech augmentation based unsupervised
...
Low resource automatic speech recognition (ASR) is a useful but thorny t...
Singing voice synthesis is a generative task that involves multi-dimensi...
Recently, synthesizing personalized speech by text-to-speech (TTS)
appli...
Metaverse has stretched the real world into unlimited space. There will ...
Metaverse is an interactive world that combines reality and virtuality, ...
Incomplete utterance rewriting (IUR) has recently become an essential ta...
Any-to-any voice conversion problem aims to convert voices for source an...
Multi-speaker text-to-speech (TTS) using a few adaption data is a challe...
Voice Conversion(VC) refers to changing the timbre of a speech while
ret...
This paper investigates a novel task of talking face video generation so...
Large-scale deep neural networks (DNNs) such as convolutional neural net...
Predicting the altered acoustic frames is an effective way of self-super...
Self-attention models have been successfully applied in end-to-end speec...
Slot filling and intent detection have become a significant theme in the...
Several domains own corresponding widely used feature extractors, such a...
Recent neural vocoders usually use a WaveNet-like network to capture the...