Investigations on Phoneme-Based End-To-End Speech Recognition
Common end-to-end models like CTC or encoder-decoder-attention models use characters or subword units like BPE as the output labels. We do systematic comparisons between grapheme-based and phoneme-based output labels. These can be single phonemes without context ( 40 labels), or multiple phonemes together in one output label, such that we get phoneme-based subwords. For this purpose, we introduce phoneme-based BPE labels. In further experiments, we extend the phoneme set by auxiliary units to be able to discriminate homophones (different words with same pronunciation). This enables a very simple and efficient decoding algorithm. We perform the experiments on Switchboard 300h and we can show that our phoneme-based models are competitive to the grapheme-based models.
READ FULL TEXT