End-to-end Speech Recognition with Adaptive Computation Steps
In this paper, we present Adaptive Computation Steps (ACS) algorithm, which enables end-to-end speech recognition models to dynamically decide how many frames should be processed to predict a linguistic output. The ACS equipped model follows the classic encoder-decoder framework, while unlike the attention-based models, it produces alignments independently at the encoder side using the correlation between adjacent frames. Thus, predictions can be made as soon as sufficient inter-frame information is received, which makes the model applicable in online cases. We verify the ACS algorithm on an open-source Mandarin speech corpus AIShell-1, and it achieves a parity of 35.2 the attention-based model in the online occasion. To fully demonstrate the advantage of ACS algorithm, offline experiments are conducted, in which our ACS model achieves 21.6 outperforming the attention-based counterpart. Index Terms: Adaptive Computation Steps, Encoder-Decoder Recurrent Neural Networks, End-to-End Training.
READ FULL TEXT