Attention-based conditioning methods using variable frame rate for style-robust speaker verification

06/28/2022
by   Amber Afshan, et al.
0

We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using the bottleneck features as speaker representations. Such a network has a pooling layer to transform frame-level to utterance-level features by calculating statistics over all utterance frames, with equal weighting. However, self-attentive embeddings perform weighted pooling such that the weights correspond to the importance of the frames in a speaker classification task. Entropy can capture acoustic variability due to speaking style variations. Hence, an entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer to provide the network with information that can address style effects. This work explores five different approaches to conditioning. The best conditioning approach, concatenation with gating, provided statistically significant improvements over the x-vector baseline in 12/23 tasks and was the same as the baseline in 11/23 tasks when using the UCLA speaker variability database. It also significantly outperformed self-attention without conditioning in 9/23 tasks and was worse in 1/23. The method also showed significant improvements in multi-speaker scenarios of SITW.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2018

Attentive Statistics Pooling for Deep Speaker Embedding

This paper proposes attentive statistics pooling for deep speaker embedd...
research
08/08/2020

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

The effects of speaking-style variability on automatic speaker verificat...
research
06/28/2022

Learning from human perception to improve automatic speaker verification in style-mismatched conditions

Our prior experiments show that humans and machines seem to employ diffe...
research
02/05/2020

Identification of Indian Languages using Ghost-VLAD pooling

In this work, we propose a new pooling strategy for language identificat...
research
01/28/2020

Masked cross self-attention encoding for deep speaker embedding

In general, speaker verification tasks require the extraction of speaker...
research
11/08/2018

Phonetic-attention scoring for deep speaker features in speaker verification

Recent studies have shown that frame-level deep speaker features can be ...
research
11/10/2020

Supervised attention for speaker recognition

The recently proposed self-attentive pooling (SAP) has shown good perfor...

Please sign up or login with your details

Forgot password? Click here to reset