Attention-based conditioning methods using variable frame rate for style-robust speaker verification

by   Amber Afshan, et al.

We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using the bottleneck features as speaker representations. Such a network has a pooling layer to transform frame-level to utterance-level features by calculating statistics over all utterance frames, with equal weighting. However, self-attentive embeddings perform weighted pooling such that the weights correspond to the importance of the frames in a speaker classification task. Entropy can capture acoustic variability due to speaking style variations. Hence, an entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer to provide the network with information that can address style effects. This work explores five different approaches to conditioning. The best conditioning approach, concatenation with gating, provided statistically significant improvements over the x-vector baseline in 12/23 tasks and was the same as the baseline in 11/23 tasks when using the UCLA speaker variability database. It also significantly outperformed self-attention without conditioning in 9/23 tasks and was worse in 1/23. The method also showed significant improvements in multi-speaker scenarios of SITW.


page 1

page 2

page 3

page 4


Attentive Statistics Pooling for Deep Speaker Embedding

This paper proposes attentive statistics pooling for deep speaker embedd...

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

The effects of speaking-style variability on automatic speaker verificat...

Learning from human perception to improve automatic speaker verification in style-mismatched conditions

Our prior experiments show that humans and machines seem to employ diffe...

Identification of Indian Languages using Ghost-VLAD pooling

In this work, we propose a new pooling strategy for language identificat...

Masked cross self-attention encoding for deep speaker embedding

In general, speaker verification tasks require the extraction of speaker...

Phonetic-attention scoring for deep speaker features in speaker verification

Recent studies have shown that frame-level deep speaker features can be ...

Supervised attention for speaker recognition

The recently proposed self-attentive pooling (SAP) has shown good perfor...

Please sign up or login with your details

Forgot password? Click here to reset