![](https://velog.velcdn.com/images/shining_arrow/post/58a258ff-02b0-425e-8073-9e1a13c17733/image.png)
Full Stack Deep Learning 강의를 듣고 정리한 내용입니다.
📌Transfer Learning in Computer Vision
-
Resnet-50와 같은 deep neural network(NN)들은 성능이 좋긴함.
-
But they are so large that it overfits on our small data.
-
Solution: ImageNet을 사용해 NN을 train한 후 fine-tuning하기
-
Result: better performance!
![](https://velog.velcdn.com/images/shining_arrow/post/40743dbc-f5f7-4cd0-b285-b20469e2907b/image.png)
![](https://velog.velcdn.com/images/shining_arrow/post/03bd052f-fad0-46c5-a0d8-98e02c9b196a/image.png)
-
Fine-tuning model: ImageNet을 사용해 모델을 학습시킨 후 마지막 몇 개의 layer만 다른 것으로 replace.
-
Tensorflow와 Pytorch로 쉽게 구현 가능.
📌Embeddings and Language Models
- NLP에서의 input: sequence of words.
- Deep Learning에서의 input: vectors.
- How do we convert words to vectors?
- Idea: one-hot encoding
- look up whatever words you need to encode in the dictionary.
- All-zero vector except one at the position where that word is in the dictionary.
- Problem: scales poorly with vocab size.
- Logically violates what we know about word similarity
- SOL1)Map one-hot to dense vectors
- SOL2)Learn a Language Model
- "pre-train" for your NLP task by learning a really good word embedding
- How? -> train for a very general task on a large corpus of text(e.g., wikipedia).
![](https://velog.velcdn.com/images/shining_arrow/post/c74c54f0-5ed6-45d8-aee1-31bb8f4302f3/image.png)
- N-Grams: Slide an N-sized window through the text, forming a dataset of predicting the last word.
- Skip-Grams: N-Gram의 성능을 향상시킬 수 있음. Look on both sides of the target word and form multiple samples from each N-gram.
- How to train faster: use binary instead of multi-class
📌NLP's ImageNet moment: ELMO and ULMFit on datasets
-
Word2Vec and GloVe embeddings became popular in 2013-14
-
많은 task에서 정확도를 큰 폭으로 높임.
-
But these representations are shallow:
- only first layer would have benefit of seeing all of Wikipedia
- rest of the layers would be trained on the task dataset( your data), which is way smaller
-
Why not pre-train more layers and disambiguate words, learn gramar, etc? -> ELMO 에서 처음 적용함
![](https://velog.velcdn.com/images/shining_arrow/post/29c2b7b1-b30c-4758-9684-de4b8e5a06ce/image.png)
![](https://velog.velcdn.com/images/shining_arrow/post/b7d7d784-15fa-4571-b71f-77eb17001df7/image.png)
- SQuAD dataset
![](https://velog.velcdn.com/images/shining_arrow/post/ed792a6e-7478-4f43-80a3-2465afd7bdf7/image.png)
- SNLI dataset
![](https://velog.velcdn.com/images/shining_arrow/post/93a22353-3945-45dd-96d9-1fb233381ea5/image.png)
- GLUE dataset
![](https://velog.velcdn.com/images/shining_arrow/post/18b8dfa0-5a12-4587-9b27-0ed924eb7b88/image.png)
-
ULMFit
- similar to ELMO
- took hackers' approach to deep learning
- 💡Attention in detail
- Input: sequence of tensors
- Output: sequence of tensors, each one a weighted sum of the input sequence
- Not a learned weight, but a function of x_i and x_j
- How do we learn weights?
- 1)Query: Compared to every other vector to comput attention weight for its own output y_i
- 2)Key: Compared to every other vector to comput attention weight w_ij for output y_j
- 3)Value: Summed with other vectors to form the result of the attention weighted sum
- Mulitiple Heads
- Layer Normalization
- Neural networks work best when inputs to a layer have uniform mean and standard deviation in each dimension.
- Layer 사이에서 normalization을 사용해 uniform mean + standard deviation 갖도록 하는 hack
- 💡BERT, GPT-2, DistillBERT, T5
- GPT, GPT-2
- Generative Pre-trained Transformer
- GPT learns to predict the next word in the sequence(generating text), just like ELMO or ULMFit
- 그치만 ELMO와 ULMFit은 embedding layer와 LSTM을 사용하고 GPT는 embedding layer와 transfer layer사용함.
- Uses masked self-attention
- BERT
- Bidirectional Encoder Representation from Transformers
- GPT(uni-directional)와 달리 bidirectional
- involves pre-training on A LOT of text w/ 15% of all words masked out
- also sometimes predicts whether one sentence follows another
- T5
- Text-to-Text Transfer Transformer
- Feb 2020
- Evaluated most recent transfer learning techniques
- Input and Output are both text strings
- Trained on C4(Colossal Clean Crawled Corpus) - 100x larger than Wikipedia
- 11B parameters
- SOTA on GLUE, SuperGLUE, SQuAD
- GPT-3
- DistillBERT
- a smaller model is trained to reproduce output of a larger model
- Transformers are growing in size
![](https://velog.velcdn.com/images/shining_arrow/post/10ae5489-d3f3-4440-a06f-31c707e17f41/image.png)
📕READINGS
Attention is all you need(2017) http://peterbloem.nl/blog/transformers