Audio deepfake detection is an emerging topic in the artificial intellig...
In this paper, we propose a novel self-distillation method for fake spee...
Text-to-speech (TTS) and voice conversion (VC) are two different tasks b...
Text-based speech editing allows users to edit speech by intuitively cut...
Current end-to-end code-switching Text-to-Speech (TTS) can already gener...
Recently, pioneer research works have proposed a large number of acousti...
The traditional vocoders have the advantages of high synthesis efficienc...
The text-based speech editor allows the editing of speech through intuit...
Audio deepfake detection is an emerging topic, which was included in the...
End-to-end singing voice synthesis (SVS) is attractive due to the avoida...
Transducer-based models, such as RNN-Transducer and transformer-transduc...
The autoregressive (AR) models, such as attention-based encoder-decoder
...
Attention-based encoder-decoder (AED) models have achieved promising
per...
Recurrent neural networks (RNNs) have shown significant improvements in
...
The joint training framework for speech enhancement and recognition meth...
Despite the recent significant advances witnessed in end-to-end (E2E) AS...
Non-autoregressive transformer models have achieved extremely fast infer...
Although attention based end-to-end models have achieved promising
perfo...
Monaural speech dereverberation is a very challenging task because no sp...
In this paper, we propose an end-to-end post-filter method with deep
att...
Multi-channel deep clustering (MDC) has acquired a good performance for
...
For most of the attention-based sequence-to-sequence models, the decoder...
Because an attention based sequence-to-sequence speech (Seq2Seq) recogni...
In a typical voice conversion system, prior works utilize various acoust...
Recurrent neural network transducers (RNN-T) have been successfully appl...
Deep clustering (DC) and utterance-level permutation invariant training
...
Neural end-to-end TTS can generate very high-quality synthesized speech,...
Integrating an external language model into a sequence-to-sequence speec...
In order to improve the performance for far-field speech recognition, th...
This paper focuses on two key problems for audio-visual emotion recognit...