[Paper Review] Seq2seq

Sung Jae Hyuk·2023년 10월 30일

NLP Seq2Seq

Papers

목록 보기

2/7

Summary of Wordvec

Keyword: Stacked LSTM

Prerequisites / Previous work

RNN
- Whereas in the existing Feed Forward Neural Network, learning only depended on the input, in the RNN, a hidden state that reflects the past state is also added and learning proceeds.
- If the input at time $t$ is $x_t$ , there is $h_t$ that stores past information in the middle, so the output $y_t$ is stored in the form of $x_t$ and $h_t$ .
- In addition, the size of the input has always been fixed in conventional FFNNs, whereas in RNNs, it is relatively free to input because it can be flowed in time.
- Calculating each vector is following:
$\begin{aligned} h_t &= \tanh (W_{xh} x_t + W_{hh} h_{t-1}+ b_h) \\ y_t &= W_{hy} h_t + b_y \end{aligned}$

where $W_{xh},\ W_{hh},\ W_{hy}$ are sharing matrix at every time step => That means it is easy to cause vanishing / exploding gradient problem.
- As the presence of hidden vectors that store information from the past,it learns to selectively retain relevant information to capture temporal dependency or structure across multiple time steps for processing sequential data
LSTM
- A model proposed to solve the Long-term dependency / Vanishing (Exploding) gradient problem, which is the biggest problem of RNN (Recurrent Neural Network).
  - Because it shares the same weight, the value of gradient becomes uncontrollably small or large when the weight goes wrong even once.
  - So we need an independent cell that is responsible for regulating the weights.
- The reason why it does not contain information from the past for a long time is that the size of the hidden vector quickly dissipates existing information due to non-linearity function and frequent matrix operations
  - To improve this, we use a cell-state that constantly loses information.
- In addition, rather than simply using the current information, to regularize information leakage
  1. how much information up to the previous point will be forgotten (Forget)
  2. how much current information will be reflected (Input)
  3. how much the final value will be actually used (Output)
- Specifically, let input at time $t$ be $x_{t}$ , cell state(Long interval), hidden state(Short interval) at time $t$ be $c_{t}$ , $h_{t}$
- In addition to the three key terms, there are four additional terms that serve as forget, input, output, and core within the gate. Let's call it $f_t,\ i_t,\ o_t,\ g_t$ , respectively.
- Then, update formula is following:
  $\begin{aligned} f_t &= \sigma(W_{xh}^f x_t + W_{hh}^f h_{t-1} + b_h^f) \\ i_t &= \sigma(W_{xh}^i x_t + W_{hh}^i h_{t-1} + b_h^i) \\ o_t &= \sigma(W_{xh}^o x_t + W_{hh}^o h_{t-1} + b_h^o) \\ g_t &= \tanh(W_{xh}^g x_t + W_{hh}^g h_{t-1} + b_h^g) \\ c_t &= \boldsymbol{f_t \odot c_{t-1} + i_t \odot g_t} \\ h_t &= o_t \odot \tanh(c_t)\ \\ \end{aligned}$
- Note that $c_t$ doesn't use activate function, so linearity preserved.
Attention
- Seq2Seq learning is a framework for mapping one sequence to another sequence.
  - e.g. Machine Translation
- Usually, the encoder part that receives the input, the decoder that outputs the output through it, and the hidden layer between them are hidden.
- When we use seq2seq, we mostly use lstm, but there are two major problems with this.
1. Fixed size of the hidden vector connecting the middle in the Encoder-Decoder of the LSTM.
  For this reason, long sentences cannot contain all the meanings, and short sentences are inefficient because time remains even though they are sufficiently contained.
2. Flows cannot be ignored because they are time dependent. In other words, if you're doing lstm for one sentence, you can't handle it in parallel.
- In the case of the second problem, it is a limitation of the LSTM model and cannot be solved unless the model itself is changed.
- In the first case, if the length is long, it can be solved a little.
- If we're actually outputting, we don't rely on the decoder alone, we just look at a specific part in the encoder and focus on that for translation.
- Based on this idea, when the decoder is outputting, it calculates where to pay attention in the present situation by referring to all the hidden vectors of the encoder.
  - We called it this mechanism "attention"

Problem of Previous work

Fixed size of the hidden vector connecting the middle in the Encoder-Decoder of the LSTM.
For this reason, long sentences cannot contain all the meanings, and short sentences are inefficient because time remains even though they are sufficiently contained.
Even if we use the attention technique, we cannot solve this for short sentences rather than long sentences.
Also, if a single lstm itself fails to map to a hidden vector, no matter how much attention is applied, it cannot utilize all the information.

Model & Conclusion

Just as we solved the XOR problem in FFNN with MLP, this paper tried to solve the above problem through multi-layered LSTM.
Compared to the existing single-layered LSTM, the output at each time $t$ becomes the input of the new LSTM, and learning proceeds again.
A total of $4$ layers were stacked for learning, and the final result was derived.
Deep models were often exponentially more efficient at representing some functions than a shallow one
Also, when learning, the sentences translated from the train set were reversed in reverse order and put as input, resulting in much better results.
- This can be explained by a mechanism called “minimum time lag”.
- In general, all hidden vectors contain more information close to the present than information from the distant past.
- It is said that the long dependency was resolved through the cell state of the LSTM, but it cannot be denied that the short dependencies are stronger.
- If the order is not reversed, the translation result of the first part is usually the first part of the actual sentence.
- In other words, to capture this feature, long-term dependency is more actively used than short dependency.
- However, If you reverse the order of the Input, the first part of the translation result will actually be the same as the translation result of the last part.
- This outperformes with much more short dependencies.
- This will later develop into Bi-LSTM, which scans the sentence from the front as well as the back from the LSTM and grasps the front and back contexts at the same time.
In addition, since there is a problem with time inefficiency or hidden vector compression depending on the length of the sentence, the learning was conducted only with sentences with the same length for each batch, and the training was conducted at a different LSTM for each length.
Stacked LSTM showed remarkable results, but unfortunately, it did not use any attention.
I wonder if they would have gotten much better results if they had used attention on the last floor.

Sung Jae Hyuk

Hello World!

이전 포스트

[Paper Review] Word2Vec

다음 포스트

[Paper Review] Seq2seq

Papers

Summary of Wordvec

Prerequisites / Previous work

Problem of Previous work

Model & Conclusion

[Paper Review] Word2Vec

[Paper Review] FastText

0개의 댓글

관련 채용 정보