Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation

06/04/2019
by   Elizabeth Salesky, et al.
0

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60 improvements hold across multiple data sizes and two language pairs.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset