Blockwise self-attentional encoder models have recently emerged as one
p...
Non-autoregressive (NAR) modeling has gained significant interest in spe...
Audio-visual representation learning aims to develop systems with human-...
Text language models have shown remarkable zero-shot capability in
gener...
Collecting audio-text pairs is expensive; however, it is much easier to
...
Previous Multimodal Information based Speech Processing (MISP) challenge...
We propose a decoder-only language model, VoxtLM, that can perform
four ...
Automatic speech recognition (ASR) based on transducers is widely used. ...
Although frame-based models, such as CTC and transducers, have an affini...
Neural speech separation has made remarkable progress and its integratio...
There has been an increased interest in the integration of pretrained sp...
End-to-end speech summarization has been shown to improve performance ov...
The CHiME challenges have played a significant role in the development a...
Self-supervised learning (SSL) has led to great strides in speech proces...
Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL...
In reverberant conditions with multiple concurrent speakers, each microp...
Self-supervised learning (SSL) of speech has shown impressive results in...
Self-supervised learning (SSL) has achieved notable success in many spee...
We investigate the emergent abilities of the recently proposed web-scale...
Conformer, a convolution-augmented Transformer variant, has become the d...
Speech processing Universal PERformance Benchmark (SUPERB) is a leaderbo...
Most of the speech translation models heavily rely on parallel data, whi...
Recently there have been efforts to introduce new benchmark tasks for sp...
This paper describes our system for the low-resource domain adaptation t...
Most human interactions occur in the form of spoken conversations where ...
Large language models (LLMs) have exhibited remarkable capabilities acro...
We propose FSB-LSTM, a novel long short-term memory (LSTM) based archite...
This paper introduces a novel Token-and-Duration Transducer (TDT)
archit...
It has been known that direct speech-to-speech translation (S2ST) models...
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitat...
Transformer-based end-to-end speech recognition has achieved great succe...
The Multi-modal Information based Speech Processing (MISP) challenge aim...
In the last decade of automatic speech recognition (ASR) research, the
i...
Self-supervised speech representation learning (SSL) has shown to be
eff...
Multilingual Automatic Speech Recognition (ASR) models have extended the...
Despite rapid advancement in recent years, current speech enhancement mo...
Speech enhancement models have greatly progressed in recent years, but s...
This paper describes our submission to the Second Clarity Enhancement
Ch...
To build speech processing methods that can handle speech as naturally a...
Recent Text-to-Speech (TTS) systems trained on reading or acted corpora ...
While neural text-to-speech (TTS) has achieved human-like natural synthe...
The network architecture of end-to-end (E2E) automatic speech recognitio...
Spoken language understanding (SLU) tasks have been studied for many dec...
Self-supervised pre-trained transformers have improved the state of the ...
Direct speech-to-speech translation (S2ST), in which all components can ...
While human evaluation is the most reliable metric for evaluating speech...
This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EU...
We propose TF-GridNet for speech separation. The model is a novel multi-...
Disfluency detection has mainly been solved in a pipeline approach, as
p...
We present a unified system to realize one-shot voice conversion (VC) on...