1. [NLP] Recurrent Neural Network and Language Modeling
- Recurrent Neural Network
Basic structure
Inputs and outputs of RNNs(rolled version)
We usually want to predict a vector at some time steps
How to calculate the hidden state of RNNs
We can process a sequence of vectors by applying a recurrence formula at every time step
The same function and the same set of parameters are used at every time step
The state consists of a single "hidden" vector h
- Types of RNNs
One-to-one
Standard Neural Networks
One-to-many
Image Captioning
Many-to-one
Sentiment Classification
Many-to-many(Seq2Seq)
Machine translation
Many-to-many
Video classification on frame level
NER, POS
Character-level Language Model
Example of training of training sequence "hello"
Vocabulary : [h, e, l, o]
Example training sequence : "hello"
ht=tanh(Whhht−1+Wxhxt+b)
Logit=Whyht+b
At test time, sample characters one at a time, feed back to model
Backpropagation through time(BPTT)
Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient
Run forward and backward through chunks of the sequence of whole sequnce and carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
Vanishing/Expolding Gradient Problem in RNN
RNN is excellent, but...
Multiplying the same matrix(Whh) at each time step during backpropagation causes gradient vanishing or exploding
Toy Example
The reason why the vanishing gradient problem is import tant(참고 링크)
이미지는 Whh를 의미하고, 숫자는 timestep을 의미하며, 회색이 되가는 것은 0이 되는 것을 의미
2. [NLP] LSTM and GRU
- Long Short-Term Memory(LSTM)
What is LSTM(Long Short-Term Memory)?
Long short-term memory
i : Input gate, Whether to write to cell
f : Forget gate, Whether to erase cell
o : Output gate, How much to reveal cell
g : Gate gate, How much to write to cell
A gate exits for controlling how much information could flow from cell state
1) Forget gate
ft=σ(Wf⋅[ht−1,xt]+bf)
2) Generate information to be added and cut it by Input gate
it=σ(Wi⋅[ht−1,xt]+bi)
Ct=tanh(WC⋅[ht−1,xt]+bC)
3) Generate new cell state by adding current information to previous cell state
Ct=ft⋅Ct−1+it⋅Ct
4) Generate hidden state by passing cell state to tanh and Output gate
5) Pass this hidden state to next time step, and output or next layer if needed
ot=σ(Wo[ht−1,xt]+bo)
ht=ot⋅tanh(Ct)
Gated Recurrent Unit(GRU)
What is GRU?
zt=σ(Wz⋅[ht−1,xt])
rt=σ(Wr⋅[ht−1,xt])
h~t=tanh(W⋅[rt⋅ht−1,xt])
ht=(1−zt)⋅ht−1+zt⋅h~t
c.f) Ct=ft⋅Ct−1+it⋅Ct in LSTM
Backpropagation in LSTM, GRU
Uninterrupted gradient flow!
Whh가 아닌 gate를 활용해 필요로 하는 정보를 덧셈을 통해 만들어주기 때문에 gradient vanishing/exploding 문제를 해결
여기서 덧셈은 backpropagation을 수행할 때, gradient를 복사하는 역할을 하기 때문에 더 긴 타임 스텝까지 정보를 전달
Summary on RNN/LSTM/GRU
RNNs allow a lot of flexiblity architecture design
Vanilla RNN are simple but don't work very well
Backward flow of gradient in RNN can explode or vanish
Common to use LSTM or GRU : their additive interactions improve gradient flow
피어세션 정리
강의 내용 관련
BPTT 이외에 RNN/LSTM/GRU의 구조를 유지하면서 gradient vanishing/exploding 문제를 완화할 수 있는 방법이 있을까요?
truncated-BPTT
weight 초기화 : xavier, kaiming
RNN/LSTM/GRU 기반의 Language Model에서 초반 time step의 정보를 전달하기 어려운 점을 완화할 수 있는 방법이 있을까요?
질문1) text input 길이가 변하면 rnn cell이 늘어나면서, 모델 구조가 바뀌게 되는 것인가요?
input 길이가 변한다고 하더라도 rnn cell이 반복적으로 적용되는 것이기 때문에 모델 구조가 바뀌지는 않는 것 같습니다!
질문2) rnn output이 각각 뭘 의미하는 건가요?
hidden_state 는 모든 time step에 대한 것, h_n은 마지막 time step에 대한 것
질문3) batch_emb를 transpose를 하는 이유?
time step에 대해서 계산하기 위해서
질문4) 필수과제2번 forward에서 LSTM/GRU 나눠서 처리해야하는지
weight initialization을 train 함수에서 하기 때문에 그렇게 안 하셔도 될 것 같습니다.