4.Transformers

HK·2022년 5월 17일
0

Full Stack Deep Learning

목록 보기
4/6

Full Stack Deep Learning 강의를 듣고 정리한 내용입니다.

📌Transfer Learning in Computer Vision

  • Resnet-50와 같은 deep neural network(NN)들은 성능이 좋긴함.

  • But they are so large that it overfits on our small data.

  • Solution: ImageNet을 사용해 NN을 train한 후 fine-tuning하기

  • Result: better performance!

  • Fine-tuning model: ImageNet을 사용해 모델을 학습시킨 후 마지막 몇 개의 layer만 다른 것으로 replace.

  • Tensorflow와 Pytorch로 쉽게 구현 가능.

📌Embeddings and Language Models

  • NLP에서의 input: sequence of words.
  • Deep Learning에서의 input: vectors.
  • How do we convert words to vectors?
  • Idea: one-hot encoding
    • look up whatever words you need to encode in the dictionary.
    • All-zero vector except one at the position where that word is in the dictionary.
    • Problem: scales poorly with vocab size.
    • Logically violates what we know about word similarity
  • SOL1)Map one-hot to dense vectors
  • SOL2)Learn a Language Model
    • "pre-train" for your NLP task by learning a really good word embedding
    • How? -> train for a very general task on a large corpus of text(e.g., wikipedia).
    • N-Grams: Slide an N-sized window through the text, forming a dataset of predicting the last word.
    • Skip-Grams: N-Gram의 성능을 향상시킬 수 있음. Look on both sides of the target word and form multiple samples from each N-gram.
    • How to train faster: use binary instead of multi-class

📌NLP's ImageNet moment: ELMO and ULMFit on datasets

  • Word2Vec and GloVe embeddings became popular in 2013-14

  • 많은 task에서 정확도를 큰 폭으로 높임.

  • But these representations are shallow:

    • only first layer would have benefit of seeing all of Wikipedia
    • rest of the layers would be trained on the task dataset( your data), which is way smaller
  • Why not pre-train more layers and disambiguate words, learn gramar, etc? -> ELMO 에서 처음 적용함


    - SQuAD dataset

    - SNLI dataset

    - GLUE dataset

  • ULMFit

    • similar to ELMO
    • took hackers' approach to deep learning

📌Transformers

  • 💡Attention in detail
    • Input: sequence of tensors
    • Output: sequence of tensors, each one a weighted sum of the input sequence
    • Not a learned weight, but a function of x_i and x_j
    • How do we learn weights?
      • 1)Query: Compared to every other vector to comput attention weight for its own output y_i
      • 2)Key: Compared to every other vector to comput attention weight w_ij for output y_j
      • 3)Value: Summed with other vectors to form the result of the attention weighted sum
    • Mulitiple Heads
    • Layer Normalization
      • Neural networks work best when inputs to a layer have uniform mean and standard deviation in each dimension.
      • Layer 사이에서 normalization을 사용해 uniform mean + standard deviation 갖도록 하는 hack

  • 💡BERT, GPT-2, DistillBERT, T5
    • GPT, GPT-2
      • Generative Pre-trained Transformer
      • GPT learns to predict the next word in the sequence(generating text), just like ELMO or ULMFit
        • 그치만 ELMO와 ULMFit은 embedding layer와 LSTM을 사용하고 GPT는 embedding layer와 transfer layer사용함.
      • Uses masked self-attention
    • BERT
      • Bidirectional Encoder Representation from Transformers
      • GPT(uni-directional)와 달리 bidirectional
      • involves pre-training on A LOT of text w/ 15% of all words masked out
        • also sometimes predicts whether one sentence follows another
    • T5
      • Text-to-Text Transfer Transformer
      • Feb 2020
      • Evaluated most recent transfer learning techniques
      • Input and Output are both text strings
      • Trained on C4(Colossal Clean Crawled Corpus) - 100x larger than Wikipedia
      • 11B parameters
      • SOTA on GLUE, SuperGLUE, SQuAD
    • GPT-3
    • DistillBERT
      • a smaller model is trained to reproduce output of a larger model
    • Transformers are growing in size

📕READINGS
Attention is all you need(2017) http://peterbloem.nl/blog/transformers

0개의 댓글