Full Stack Deep Learning 강의를 듣고 정리한 내용입니다.
Sequence Problem 예시
Why not use feedforward networks instead?
Problem 1: Variable Length Inputs
Problem 2: Memory Scaling
Problem 3: Overkill
목적: use a compute_next_h function that preserves gradients -> vanishing gradients issue를 해결하고자 함.
Main idea: introduce a new "cell state" channel
What about other RNNs like GRUs?
Key questions for machine learning applications papers
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation(Wu et al.,2016)
- 1) What problem are they trying to solve?
2) What model architecture was used?
Encoder와 decoder이 모두 RNN인 Encoder-decoder architecture
Problem 1: using single layer will underfit the task
- Solution 1: stack LSTM layers
Problem 2: Stacked LSTMs are hard to train. They barely work with more than 6 layers.
Solution 2: add residual connections(ResNet처럼). LSTM의 layer 사이에 skip function을 둔다.
Problem 3: bottleneck between the encoder and decoder when dealing with large information
Solution 3: Attention(Attention에 대해서는 다음주에 더 자세히 다룸.)
Idea: 각 언어의 어순을 고려해 번역하려는 단어마다 관련 있는 문장의 영역에 집중한다.
How: Relevance score을 사용해 문장 속 각 단어들에 대한 attention value를 계산. Input sentence의 모든 단어들이 output sentence의 특정 단어와 얼마나 관련있는지 구한다.
Attention은 번역 외에 다양한 분야에도 쓰인다. (음성인식이나 img model에도 쓰임)
Summary of GNMT approach
Sequence data does not require recurrent models!
Hence, convolutional approach can be used for sequence data modeling
WaveNet: A Generative Model for Raw Audio(van den Oord et al, 2016)
used in Google Assistant and Google Cloud Text to Speech(e.g., for call centers)
Main idea: convolutional sequence models
1) What problem are they trying to solve?
2) What model architecture was used?
3) Insights of solution
An output is produced by looking at the window of previous hidden layers, and those inputs from the previous hidden layer is produced from a window of inputs of the layer before that.
Causal convolution: the entire window is from the past. You don't look at any future data layers to get the output.
4) Challenge of the solution : getting a large receptive field
5) Solution of the challenge : dilated convolutions
6) What model architecture was used?
7) What dataset was it trained on?
8) How was it trained?
9) What tricks were needed for inference in deployment?
💡READING