Transformer - 시각적 이해를 위한 머신러닝 4

zzwon1212·2024년 7월 12일
0

딥러닝

목록 보기
28/33

13. Transformers Ⅰ

13.1. Word Embedding

A word can be represented as a vector.

  • Word2vec

    • Using a large corpus,
      • Predict the current word from neighboring words (Common Bag of Worlds; CBOW), or
      • Predict the surrounding words given the current one (Skip-gram)
    • Word meaning is determined by its (frequently co-occruing) neighboring words.
    • Word vectors are fitted to maximize the likelihood:
      L(θ)=i=1Nmjm,j0p(wt+jwt;θ)L(\theta) = \prod_{i=1}^N \prod_{-m \leq j \leq m, j \neq 0} p(w_{t+j} | w_t; \theta)
  • GloVe

    • Global Vetors: global corpus statistics are directly captured.
    • Ratios of co-occurrence probabilities can encode meaning components:
      wiwj=logp(ij)J=i,j=1Vf(Xij)(wiw~j+bi+b~jlogXij)2w_i^\intercal w_j = \log p(i|j) \quad J = \sum_{i, j=1}^V f(X_{ij}) (w_i^\intercal \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2

13.2. Transformers

Attention Is All You Need

In MLPs, CNNs, and RNNs, the output y^\hat{y} is a weighted sum (+fixed unary operations) of the input xx. That is, W\mathrm{W} is optimized to best map the input to the output in the training set, in terms of the loss function.

  • Attention function
    Attention(Q,K,VQ, K, V) = Attention value

  • Self-attention
    Each element learns to refine its own representation by attending its context (other elements in the input). More specifically, as a weighted sum of other elements (not arbitrary weights) in the sequence.

    • With Transformer, we make Query, Key, and Value.
      • From the input tokens {x1,x2,...,xN}\{x_1, x_2, ..., x_N\}
      • Each token xix_i is mapped to its own Query QiQ_i, Key KiK_i, Value ViV_i vectors by a linear transformation.
      • The linear weights (WQ,WK,WV)(W_Q, W_K, W_V) are the learned parameters, shared by all inputs.
      • WQ(WK,WV)W_Q (W_K, W_V) learns how to represent a vector to serve as a Query (Key, Value) in general.
      • We need another learnable parameter WOW_O, which maps the attention value back to the original space.
      • Each token xix_i becomes the Query when we learn about ii.
      • References are all tokens {x1,x2,...,xN}\{x_1, x_2, ..., x_N\} in the input sequence, including xix_i itself.
    • The step above figure is repeated multiple times to further contextualize.
  • Training the Transformer model

    A dummy token (called classification token) is appended to the input sequence, and use it as the aggregated embedding.

  • Transformer

    • Multi-head Self-attention
      Having multiple projections to Q,K,VQ, K, V is beneficial. Because it allows the model to jointly attend to information from different representation subspaces at different positions.
      • Multiple self-attentions output multiple attention values (Z0,Z1,...,Zk1)(Z_0, Z_1, ..., Z_{k-1}). So, simply concatenate them, then linearly transform with W0W_0 it back to the original input size.
    • Feed-forward layer
      Each contextualized embedding goes through an additional FC layer. It is applied separately and identically, so there is no cross-token dependency.
    • Residual connection
    • Layer normalization
    • Positional Encoding
      PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{\mathrm{model}}}) \\ \, \\ PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{\mathrm{model}}})
    • Masked Multi-head Self-attention
      The predictions for position ii can depend only on the known outputs at positions less than ii.
    • The decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
    • Decoding steps are repeated until the next word is predicted as [EOS] (End of Sentence).
    • The output sentence may be chosen greedily (always the top one), or deferred with top-kk choices (called beam search).

13.3. BERT (Bidirectional Encoder Representations from Transformers)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

These days, the pre-trained BERT is a default choice for word embeddings.

  • BERT

    • Large-scale pre-training of word embeddings using Transformer encoder
    • Self-supervised: no human rating required
    • Use the encoder (bi-directional; no masking) only
  • Training task 1: Masked Language Modeling (MLM)

    • Masking 15% of tokens randomly (substituting it to a special [MASK] token)
    • Classify the output embedding for these positions across the vocabulary.
  • Training task 2: Next Sentence Prediction (NSP)

    • A binary classification problem, predicting if the two sentences in the input are consecutive or not.
      • Half of training data: two consecutive sentences
      • The other half: two sentences randomly chosen from the corpus

📙 강의

profile
JUST DO IT.

0개의 댓글

관련 채용 정보