[Paper Review] Seq2seq

Sung Jae Hyuk·2023년 10월 30일
0

Papers

목록 보기
2/7

Summary of Wordvec

Keyword: Stacked LSTM

Prerequisites / Previous work

  1. RNN

    • Whereas in the existing Feed Forward Neural Network, learning only depended on the input, in the RNN, a hidden state that reflects the past state is also added and learning proceeds.
    • If the input at time tt is xtx_t, there is hth_t that stores past information in the middle, so the output yty_t is stored in the form of xtx_t and hth_t.
    • In addition, the size of the input has always been fixed in conventional FFNNs, whereas in RNNs, it is relatively free to input because it can be flowed in time.
    • Calculating each vector is following:
    ht=tanh(Wxhxt+Whhht1+bh)yt=Whyht+by\begin{aligned} h_t &= \tanh (W_{xh} x_t + W_{hh} h_{t-1}+ b_h) \\ y_t &= W_{hy} h_t + b_y \end{aligned}

    where Wxh, Whh, WhyW_{xh},\ W_{hh},\ W_{hy} are sharing matrix at every time step => That means it is easy to cause vanishing / exploding gradient problem.

    • As the presence of hidden vectors that store information from the past,it learns to selectively retain relevant information to capture temporal dependency or structure across multiple time steps for processing sequential data
  2. LSTM

    • A model proposed to solve the Long-term dependency / Vanishing (Exploding) gradient problem, which is the biggest problem of RNN (Recurrent Neural Network).

      • Because it shares the same weight, the value of gradient becomes uncontrollably small or large when the weight goes wrong even once.
      • So we need an independent cell that is responsible for regulating the weights.
    • The reason why it does not contain information from the past for a long time is that the size of the hidden vector quickly dissipates existing information due to non-linearity function and frequent matrix operations

      • To improve this, we use a cell-state that constantly loses information.
    • In addition, rather than simply using the current information, to regularize information leakage

      1. how much information up to the previous point will be forgotten (Forget)
      2. how much current information will be reflected (Input)
      3. how much the final value will be actually used (Output)
    • Specifically, let input at time tt be xtx_{t}, cell state(Long interval), hidden state(Short interval) at time tt be ctc_{t}, hth_{t}

    • In addition to the three key terms, there are four additional terms that serve as forget, input, output, and core within the gate. Let's call it ft, it, ot, gtf_t,\ i_t,\ o_t,\ g_t, respectively.

    • Then, update formula is following:

      ft=σ(Wxhfxt+Whhfht1+bhf)it=σ(Wxhixt+Whhiht1+bhi)ot=σ(Wxhoxt+Whhoht1+bho)gt=tanh(Wxhgxt+Whhght1+bhg)ct=ftct1+itgtht=ottanh(ct) \begin{aligned} f_t &= \sigma(W_{xh}^f x_t + W_{hh}^f h_{t-1} + b_h^f) \\ i_t &= \sigma(W_{xh}^i x_t + W_{hh}^i h_{t-1} + b_h^i) \\ o_t &= \sigma(W_{xh}^o x_t + W_{hh}^o h_{t-1} + b_h^o) \\ g_t &= \tanh(W_{xh}^g x_t + W_{hh}^g h_{t-1} + b_h^g) \\ c_t &= \boldsymbol{f_t \odot c_{t-1} + i_t \odot g_t} \\ h_t &= o_t \odot \tanh(c_t)\ \\ \end{aligned}
    • Note that ctc_t doesn't use activate function, so linearity preserved.

  3. Attention

    • Seq2Seq learning is a framework for mapping one sequence to another sequence.
      • e.g. Machine Translation
    • Usually, the encoder part that receives the input, the decoder that outputs the output through it, and the hidden layer between them are hidden.
    • When we use seq2seq, we mostly use lstm, but there are two major problems with this.
    1. Fixed size of the hidden vector connecting the middle in the Encoder-Decoder of the LSTM.
      For this reason, long sentences cannot contain all the meanings, and short sentences are inefficient because time remains even though they are sufficiently contained.
    2. Flows cannot be ignored because they are time dependent. In other words, if you're doing lstm for one sentence, you can't handle it in parallel.
    • In the case of the second problem, it is a limitation of the LSTM model and cannot be solved unless the model itself is changed.
    • In the first case, if the length is long, it can be solved a little.
    • If we're actually outputting, we don't rely on the decoder alone, we just look at a specific part in the encoder and focus on that for translation.
    • Based on this idea, when the decoder is outputting, it calculates where to pay attention in the present situation by referring to all the hidden vectors of the encoder.
      • We called it this mechanism "attention"

Problem of Previous work

  1. Fixed size of the hidden vector connecting the middle in the Encoder-Decoder of the LSTM.
    For this reason, long sentences cannot contain all the meanings, and short sentences are inefficient because time remains even though they are sufficiently contained.
    Even if we use the attention technique, we cannot solve this for short sentences rather than long sentences.
    Also, if a single lstm itself fails to map to a hidden vector, no matter how much attention is applied, it cannot utilize all the information.

Model & Conclusion

  • Just as we solved the XOR problem in FFNN with MLP, this paper tried to solve the above problem through multi-layered LSTM.
  • Compared to the existing single-layered LSTM, the output at each time tt becomes the input of the new LSTM, and learning proceeds again.
  • A total of 44 layers were stacked for learning, and the final result was derived.
  • Deep models were often exponentially more efficient at representing some functions than a shallow one
  • Also, when learning, the sentences translated from the train set were reversed in reverse order and put as input, resulting in much better results.
    • This can be explained by a mechanism called “minimum time lag”.
    • In general, all hidden vectors contain more information close to the present than information from the distant past.
    • It is said that the long dependency was resolved through the cell state of the LSTM, but it cannot be denied that the short dependencies are stronger.
    • If the order is not reversed, the translation result of the first part is usually the first part of the actual sentence.
    • In other words, to capture this feature, long-term dependency is more actively used than short dependency.
    • However, If you reverse the order of the Input, the first part of the translation result will actually be the same as the translation result of the last part.
    • This outperformes with much more short dependencies.
    • This will later develop into Bi-LSTM, which scans the sentence from the front as well as the back from the LSTM and grasps the front and back contexts at the same time.
  • In addition, since there is a problem with time inefficiency or hidden vector compression depending on the length of the sentence, the learning was conducted only with sentences with the same length for each batch, and the training was conducted at a different LSTM for each length.
  • Stacked LSTM showed remarkable results, but unfortunately, it did not use any attention.
  • I wonder if they would have gotten much better results if they had used attention on the last floor.
profile
Hello World!

0개의 댓글