[Streaming-ASR] RNN Transducer

Chris blog·2023년 9월 15일

음성인식(keyword spotting, ASR)

목록 보기

2/2

1. Pros./Cons. of RNN-T

Pros

Better accuracy: CTC에서 존재하던 Conditional independence assumption을 해소
Low latency: Streaming ASR Application에 사용 가능
RNN-T > MoChA in terms of latency, inference time, and training stability. (Comparison study from Kim et al.)
The industry tends to choose RNN-T as the dominating streaming E2E model.

Cons

Output prediction tensor takes too much memory (3D tensor) (More detail from Moriya et al.)
Vanilla RNN-T can delay its label prediction (latency of ASR is critical)

2. RNN-T formulation

$P(y_t | x_{1:t}, y_{1:u-1})$

Predicting the current token $y_t$ based on:

Previous output tokens $y_{1:u-1}$
Speech sequence $x_{1:t}$ .

3. RNN-T Structure

Encoder: Generate a high-level feature representation $h_t^{enc}$ from $x_t$
Prediction network: Generate $h_u^{pre}$ based on RNN-T's previous output label $y_{n-1}$
Joint network: A feed-forward network that combines $h_t^{pre}$ and $h_t^{enc}$ as:

$z_{t,u} = \psi(Qh_t^{enc} + Vh_u^{pre}+b_z) \\ h_{t,u} = W_{y}z_{t,u}+b_y \\ P(y_t=k | x_{1:t}, y_{1:u-1})=softmax(h_{t,u}^k)$

Parameters:
- $Q$ and $V$ are weight matrices.
- $\psi$ is a non-linear function (e.g., RELU or Tanh)
- $z_{t,u}$ is again multiplied by another weight matrix $W_y$
- $b_z$ and $b_y$ are bias vectors

3. Shape of output

$softmax(h_{t,u}^k) \in \mathbb{R}^{T\times U\times K}$

$T$ is the length of speech sequence
$U$ is the length of the label sequence
$K$ is the number of possible tokens including special symbols.
(e.g., start-of-sentence, $\langle sos \rangle$ , end-of-sentence, $\langle eos \rangle$ and blank symbol)
Thus, 3D tensor that requires much more memory than other E2E models such as CTC and AED.

4. Learnable parameters

Prediction network parameters
Encoder network parameters
$Q$ , $V$ , $b_z$ , $b_y$ , $W_y$ from Joint network

5. Alignment Paths

Three possible alignment paths from the bottom left corner to the top right corner of the $T$ x $U$ grid.
The length of alignment path: $T$ + $U$ .
Horizontal arrow: Advance one time step with a blank label.
Vertical arrow: Advance one time step with a non-block output label.

x-axis: Speech sequence $x=(x_1,x_2, ..., x_8)$
y-axis: Label sequence $y=(\langle s \rangle, t,e,a,m)$ , where $\langle s \rangle$ is a token for start-of-sentence.
Delayed decision/prediction: Green path in the image above (Latency is high because of the late prediction. Problem of vanilla RNN-T.)

6. RNN-T Loss

RNN-T tries to minimize $-lnP(y|x)$ where

$P(y|x) = \sum_{a \in A^{-1}(y)}P(a|x)$

$a$ : One of possible alignment paths
$A$ : The mapping from the alignment path $a$ to the label sequence $y$ . $A(a)=y$ .
The parameters are optimized using forward-backward algorithm (Alex et al.).

7. Forward-backward Algorithm

7.1 Implementation

(WIP)

7.2 How to improve training efficiency

Look skewing transformation: forward/backward probabilities can be vectorized. The recursions can be computed in a single loop instead of two nested loops.
Function merging: Reduce the training memory cost so that larger minibatches could be used.

8. Different Strategies for Alignments

8.1 Constrained alignment

(WIP)

8.2 FastEmit

(WIP)

8.3 Self-alignment

Summary: Self-alignment encourages the model's alignment to the left direction. (lower-latency alignment) This was reported to have better accuracy and latency tradeoff than previous methods

Blue path indicates a self-alignment path and the red path is one frame left to the self-alignment path.

During training, the method encourages the left-alignment path, pushing the model's alignment to the left direction.

Chris blog

ChrisTechBlog

이전 포스트

[Streaming-ASR] RNN Transducer

음성인식(keyword spotting, ASR)

1. Pros./Cons. of RNN-T

Pros

Cons

2. RNN-T formulation

3. RNN-T Structure

3. Shape of output

4. Learnable parameters

5. Alignment Paths

6. RNN-T Loss

7. Forward-backward Algorithm

7.1 Implementation

7.2 How to improve training efficiency

8. Different Strategies for Alignments

8.1 Constrained alignment

8.2 FastEmit

8.3 Self-alignment

[음성인식] Part 1. Connectionist Temporal Classification (CTC)

0개의 댓글