Non-autoregressive (NAR) modeling has gained significant interest in spe...
Recent models such as XLS-R and Whisper have made multilingual speech
te...
We study the problem of overcoming exponential sample complexity in
diff...
Multilingual text-video retrieval methods have improved significantly in...
Beam search, which is the dominant ASR decoding algorithm for end-to-end...
We report on aggressive quantization strategies that greatly accelerate
...
Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) ...
Dialog history plays an important role in spoken language understanding ...
We introduce two techniques, length perturbation and n-best based label
...
The lack of speech data annotated with labels required for spoken langua...
Compared to hybrid automatic speech recognition (ASR) systems that use a...
Intent classifiers are vital to the successful operation of virtual agen...
The goal of spoken language understanding (SLU) systems is to determine ...
Multi-modal learning from video data has seen increased attention recent...
In this paper, we explore self-supervised audio-visual models that learn...
Large-scale distributed training of deep acoustic models plays an import...
We investigate the impact of aggressive low-precision representations of...
When recurrent neural network transducers (RNNTs) are trained using the
...
End-to-end spoken language understanding (SLU) systems that process
huma...
Spoken intent detection has become a popular approach to interface with
...
In our previous work we demonstrated that a single headed attention
enco...
Multimodal self-supervised learning is getting more and more attention a...
We present a comprehensive study on building and adapting RNN transducer...
We investigate a set of techniques for RNN Transducers (RNN-Ts) that wer...
Data privacy and protection is a crucial issue for any automatic speech
...
Transformer networks and self-supervised pre-training have consistently
...
Training an end-to-end (E2E) neural network speech-to-intent (S2I) syste...
An essential component of spoken language understanding (SLU) is slot
fi...
Current methods for learning visually grounded language from videos ofte...
Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchr...
It is generally believed that direct sequence-to-sequence (seq2seq) spee...
There has been huge progress in speech recognition over the last several...
Modern Automatic Speech Recognition (ASR) systems rely on distributed de...
With recent advances in deep learning, considerable attention has been g...
In this paper, we propose and investigate a variety of distributed deep
...
Recent work shows unequal performance of commercial face classification
...
We study the flow of information and the evolution of internal
represent...
We propose a novel online algorithm for training deep feedforward neural...
Direct acoustics-to-word (A2W) models in the end-to-end paradigm have
re...
End-to-end (E2E) systems have achieved competitive results compared to
c...
We study large-scale kernel methods for acoustic modeling in speech
reco...
We study large-scale kernel methods for acoustic modeling and compare to...
Convolutional neural networks (CNNs) are a standard component of many cu...
The computational complexity of kernel methods has often been a major ba...
Hessian-free training has become a popular parallel second or- der
optim...
Deep Convolutional Neural Networks (CNNs) are more powerful than Deep Ne...