Predicting the current token based on:
Encoder: Generate a high-level feature representation from
Prediction network: Generate based on RNN-T's previous output label
Joint network: A feed-forward network that combines and as:
Parameters:
x-axis: Speech sequence
y-axis: Label sequence , where is a token for start-of-sentence.
Delayed decision/prediction: Green path in the image above (Latency is high because of the late prediction. Problem of vanilla RNN-T.)
RNN-T tries to minimize where
: One of possible alignment paths
: The mapping from the alignment path to the label sequence . .
The parameters are optimized using forward-backward algorithm
(Alex et al.).
(WIP)
(WIP)
(WIP)
Summary: Self-alignment encourages the model's alignment to the left direction. (lower-latency alignment) This was reported to have better accuracy and latency tradeoff than previous methods
- Blue path indicates a self-alignment path and the red path is one frame left to the self-alignment path.