Transformer - 시각적 이해를 위한 머신러닝 4

zzwon1212·2024년 7월 12일

딥러닝

목록 보기

28/33

13. Transformers Ⅰ

13.1. Word Embedding

A word can be represented as a vector.

Word2vec
- Using a large corpus,
  - Predict the current word from neighboring words (Common Bag of Worlds; CBOW), or
  - Predict the surrounding words given the current one (Skip-gram)
- Word meaning is determined by its (frequently co-occruing) neighboring words.
- Word vectors are fitted to maximize the likelihood: $L(\theta) = \prod_{i=1}^N \prod_{-m \leq j \leq m, j \neq 0} p(w_{t+j} | w_t; \theta)$
GloVe
- Global Vetors: global corpus statistics are directly captured.
- Ratios of co-occurrence probabilities can encode meaning components: $w_i^\intercal w_j = \log p(i|j) \quad J = \sum_{i, j=1}^V f(X_{ij}) (w_i^\intercal \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2$

13.2. Transformers

Attention Is All You Need

In MLPs, CNNs, and RNNs, the output $\hat{y}$ is a weighted sum (+fixed unary operations) of the input $x$ . That is, $\mathrm{W}$ is optimized to best map the input to the output in the training set, in terms of the loss function.

Attention function
Attention( $Q, K, V$ ) = Attention value
Self-attention
Each element learns to refine its own representation by attending its context (other elements in the input). More specifically, as a weighted sum of other elements (not arbitrary weights) in the sequence.
- With Transformer, we make Query, Key, and Value.
  - From the input tokens $\{x_1, x_2, ..., x_N\}$
  - Each token $x_i$ is mapped to its own Query $Q_i$ , Key $K_i$ , Value $V_i$ vectors by a linear transformation.
  - The linear weights $(W_Q, W_K, W_V)$ are the learned parameters, shared by all inputs.
  - $W_Q (W_K, W_V)$ learns how to represent a vector to serve as a Query (Key, Value) in general.
  - We need another learnable parameter $W_O$ , which maps the attention value back to the original space.
  - Each token $x_i$ becomes the Query when we learn about $i$ .
  - References are all tokens $\{x_1, x_2, ..., x_N\}$ in the input sequence, including $x_i$ itself.
- The step above figure is repeated multiple times to further contextualize.
Training the Transformer model

A dummy token (called classification token) is appended to the input sequence, and use it as the aggregated embedding.
Transformer
- Multi-head Self-attention
  Having multiple projections to $Q, K, V$ is beneficial. Because it allows the model to jointly attend to information from different representation subspaces at different positions.
  - Multiple self-attentions output multiple attention values $(Z_0, Z_1, ..., Z_{k-1})$ . So, simply concatenate them, then linearly transform with $W_0$ it back to the original input size.
- Feed-forward layer
  Each contextualized embedding goes through an additional FC layer. It is applied separately and identically, so there is no cross-token dependency.
- Residual connection
- Layer normalization
- Positional Encoding $PE_{(pos, 2i)} = sin(pos / 10000^{2i / d_{\mathrm{model}}}) \\ \, \\ PE_{(pos, 2i+1)} = cos(pos / 10000^{2i / d_{\mathrm{model}}})$
- Masked Multi-head Self-attention
  The predictions for position $i$ can depend only on the known outputs at positions less than $i$ .
- The decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
- Decoding steps are repeated until the next word is predicted as [EOS] (End of Sentence).
- The output sentence may be chosen greedily (always the top one), or deferred with top- $k$ choices (called beam search).

13.3. BERT (Bidirectional Encoder Representations from Transformers)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

These days, the pre-trained BERT is a default choice for word embeddings.

BERT
- Large-scale pre-training of word embeddings using Transformer encoder
- Self-supervised: no human rating required
- Use the encoder (bi-directional; no masking) only
Training task 1: Masked Language Modeling (MLM)
- Masking 15% of tokens randomly (substituting it to a special [MASK] token)
- Classify the output embedding for these positions across the vocabulary.
Training task 2: Next Sentence Prediction (NSP)
- A binary classification problem, predicting if the two sentences in the input are consecutive or not.
  - Half of training data: two consecutive sentences
  - The other half: two sentences randomly chosen from the corpus

📙 강의

이준석 - 시각적 이해를 위한 머신러닝

zzwon1212

JUST DO IT.

이전 포스트

RNN, Attention - 시각적 이해를 위한 머신러닝 3

다음 포스트

Transformer - 시각적 이해를 위한 머신러닝 4

딥러닝

13. Transformers Ⅰ

13.1. Word Embedding

13.2. Transformers

13.3. BERT (Bidirectional Encoder Representations from Transformers)

RNN, Attention - 시각적 이해를 위한 머신러닝 3

Transformer for Image - 시각적 이해를 위한 머신러닝 5

0개의 댓글

관련 채용 정보