Visually grounded speech systems learn from paired images and their spok...
The recently proposed Joint Energy-based Model (JEM) interprets
discrimi...
In recent studies, self-supervised pre-trained models tend to outperform...
Considering the abundance of unlabeled speech data and the high labeling...
Adversarial attacks are a threat to automatic speech recognition (ASR)
s...
Adversarial attacks pose a severe security threat to the state-of-the-ar...
Speech systems developed for a particular choice of acoustic domain and
...
The high cost of data acquisition makes Automatic Speech Recognition (AS...
The pervasiveness of intra-utterance Code-switching (CS) in spoken conte...
Typically, unsupervised segmentation of speech into the phone and word-l...
This technical report describes Johns Hopkins University speaker recogni...
Speech emotion recognition is the task of recognizing the speaker's emot...
Capitalization and punctuation are important cues for comprehending writ...
Dialog acts can be interpreted as the atomic units of a conversation, mo...
This paper introduces WaveGrad 2, a non-autoregressive generative model ...
Automatic detection of phoneme or word-like units is one of the core
obj...
The ubiquitous presence of machine learning systems in our lives necessi...
Research in automatic speaker recognition (SR) has been undertaken for
s...
This paper introduces a novel method to diagnose the source-target atten...
Data augmentation is a widely used strategy for training robust machine
...
The idea of combining multiple languages' recordings to train a single
a...
Deep learning based speech denoising still suffers from the challenge of...
Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker vo...
Unsupervised spoken term discovery consists of two tasks: finding the
ac...
We investigated an enhancement and a domain adaptation approach to make
...
Only a handful of the world's languages are abundant with the resources ...
Automatic Speech Recognition (ASR) systems introduce word errors, which ...
In this work, we explore the dependencies between speaker recognition an...
Data augmentation is conventionally used to inject robustness in Speaker...
This paper presents the problems and solutions addressed at the JSALT
wo...
Recently very deep transformers start showing outperformed performance t...
The task of making speaker verification systems robust to adverse scenar...
Current speaker recognition technology provides great performance with t...
Speaker Verification still suffers from the challenge of generalization ...
BERT, which stands for Bidirectional Encoder Representations from
Transf...
This paper presents an unsupervised segment-based method for robust voic...
Automatic measuring of speaker sincerity degree is a novel research prob...
The Multi-target Challenge aims to assess how well current speech techno...
We present JHU's system submission to the ASVspoof 2019 Challenge:
Anti-...
We explore training attention-based encoder-decoder ASR for low-resource...
In this paper, we explore several new schemes to train a seq2seq model t...
An attacker may use a variety of techniques to fool an automatic speaker...
The Multitarget Challenge aims to assess how well current speech technol...
In topic identification (topic ID) on real-world unstructured audio, an ...
An ASR system usually does not predict any punctuation or capitalization...
We describe the system our team used during NIST's LoReHLT (Low Resource...
Acoustic unit discovery (AUD) is a process of automatically identifying ...
We investigate different approaches for dialect identification in Arabic...
Learned feature representations and sub-phoneme posteriors from Deep Neu...