Current talking face generation methods mainly focus on speech-lip
synch...
The purpose of multi-object tracking (MOT) is to continuously track and
...
The task of synthetic speech generation is to generate language content ...
Neural text-to-speech (TTS) generally consists of cascaded architecture ...
We previously proposed contextual spelling correction (CSC) to correct t...
We introduce a language modeling approach for text to speech synthesis (...
Using a text description as prompt to guide the generation of text or im...
Human usually composes music by organizing elements according to the mus...
While previous speech-driven talking face generation methods have made
s...
Current text to speech (TTS) systems usually leverage a cascaded acousti...
This paper proposes a new "decompose-and-edit" paradigm for the text-bas...
Binaural audio plays a significant role in constructing immersive augmen...
Text to speech (TTS) has made rapid progress in both academia and indust...
Adaptive text to speech (TTS) can synthesize new voices in zero-shot
sce...
Contextual biasing is an important and challenging task for end-to-end
a...
Denoising diffusion probabilistic models (diffusion models for short) re...
We propose a novel robust and efficient Speech-to-Animation (S2A) approa...
In the development of neural text-to-speech systems, model pre-training ...
It's challenging to customize transducer-based automatic speech recognit...
While recent text to speech (TTS) models perform very well in synthesizi...
Text to speech (TTS) is widely used to synthesize personal voice for a t...
Custom voice, a specific text to speech (TTS) service in commercial spee...
Mean opinion score (MOS) is a popular subjective metric to assess the qu...
Text to speech (TTS) has been broadly used to synthesize natural and
int...
While neural-based text to speech (TTS) models can synthesize natural an...
Speech synthesis (text to speech, TTS) and recognition (automatic speech...
Because of its streaming nature, recurrent neural network transducer (RN...
Transformer-based text to speech (TTS) model (e.g., Transformer
TTS <cit...
Advanced text to speech (TTS) models such as FastSpeech can synthesize s...
To speed up the inference of neural speech synthesis, non-autoregressive...
Non-autoregressive (NAR) models generate all the tokens of a sequence in...
Attention-based encoder-decoder model has achieved impressive results fo...
Neural network based end-to-end text to speech (TTS) has significantly
i...
Text to speech (TTS) and automatic speech recognition (ASR) are two dual...
Grapheme-to-phoneme (G2P) conversion is an important task in automatic s...
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotro...