Indonesian Automatic Speech Recognition with XLSR-53

08/20/2023

∙

This study focuses on the development of Indonesian Automatic Speech Recognition (ASR) using the XLSR-53 pre-trained model, the XLSR stands for cross-lingual speech representations. The use of this XLSR-53 pre-trained model is to significantly reduce the amount of training data in non-English languages required to achieve a competitive Word Error Rate (WER). The total amount of data used in this study is 24 hours, 18 minutes, and 1 second: (1) TITML-IDN 14 hours and 31 minutes; (2) Magic Data 3 hours and 33 minutes; and (3) Common Voice 6 hours, 14 minutes, and 1 second. With a WER of 20 this study can compete with similar models using the Common Voice dataset split test. WER can be decreased by around 8 from 20 previous research in contributing to the creation of a better Indonesian ASR with a smaller amount of data.

READ FULL TEXT