Text language models have shown remarkable zero-shot capability in
gener...
There has been a growing interest in using end-to-end acoustic models fo...
We present the latest iteration of the voice conversion challenge (VCC)
...
Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL...
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderbo...
Most of the speech translation models heavily rely on parallel data, whi...
Large language models (LLMs) have exhibited remarkable capabilities acro...
It has been known that direct speech-to-speech translation (S2ST) models...
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitat...
Singing voice synthesis (SVS), as a specific task for generating the voc...
Multilingual Automatic Speech Recognition (ASR) models have extended the...
The network architecture of end-to-end (E2E) automatic speech recognitio...
This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EU...
Spoken language understanding (SLU) is a task aiming to extract high-lev...
We present the SUPERB challenge at SLT 2022, which aims at learning
self...
Compressing self-supervised models has become increasingly necessary, as...
Beam search, which is the dominant ASR decoding algorithm for end-to-end...
Although Transformers have gained success in several speech processing t...
Self-Supervised Learning (SSL) models have been successfully applied in
...
Deep learning based singing voice synthesis (SVS) systems have been
demo...
Transfer learning has proven to be crucial in advancing the state of spe...
Speech processing systems currently do not support the vast majority of
...
This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS)...
This paper describes the ESPnet-ST group's IWSLT 2021 submission in the
...
Self-supervised learning (SSL) has proven vital for advancing research i...
"Transcription bottlenecks", created by a shortage of effective human
tr...
Target-speaker speech recognition aims to recognize target-speaker speec...
In this study, we present recent developments on ESPnet: End-to-End Spee...
The neural network (NN) based singing voice synthesis (SVS) systems requ...
Mispronunciation detection is an essential component of the Computer-Ass...