Pretrained audio neural networks for Speech emotion recognition in Portuguese

10/26/2022
by   Marcelo Matheus Gauy, et al.
0

The goal of speech emotion recognition (SER) is to identify the emotional aspects of speech. The SER challenge for Brazilian Portuguese speech was proposed with short snippets of Portuguese which are classified as neutral, non-neutral female and non-neutral male according to paralinguistic elements (laughing, crying, etc). This dataset contains about 50 minutes of Brazilian Portuguese speech. As the dataset leans on the small side, we investigate whether a combination of transfer learning and data augmentation techniques can produce positive results. Thus, by combining a data augmentation technique called SpecAugment, with the use of Pretrained Audio Neural Networks (PANNs) for transfer learning we are able to obtain interesting results. The PANNs (CNN6, CNN10 and CNN14) are pretrained on a large dataset called AudioSet containing more than 5000 hours of audio. They were finetuned on the SER dataset and the best performing model (CNN10) on the validation set was submitted to the challenge, achieving an F1 score of 0.73 up from 0.54 from the baselines provided by the challenge. Moreover, we also tested the use of Transformer neural architecture, pretrained on about 600 hours of Brazilian Portuguese audio data. Transformers, as well as more complex models of PANNs (CNN14), fail to generalize to the test set in the SER dataset and do not beat the baseline. Considering the limitation of the dataset sizes, currently the best approach for SER is using PANNs (specifically, CNN6 and CNN10).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset