작성자 UNIST 산업공학과 김건우
Contents
Unit 01. Introduction
Unit 02. Transformers and Self-Attention
Unit 03. Image Transformer and Local Self-Attention
Unit 04. Music Transformer and Relative Positional Self-Attention
(Summariziation, Q&A etc.)
A token which refers to 'represent' in this sequence is expressed as new vectors through arbitrary calculations (will be explained later)
we set the size of dimension in K as 4 in the following example
The output vectors from self-attention has the same size as K
Complexity Per Layer: Usually, sequence length (‘n’) is smaller than model dimension (‘d’)
Sequential Operations: O(1) refers to parallelization
Maximum Path Length: O(1) refers to solve out long-term dependency
Self-attention in encoder gets all the tokens at once and process them
Self-attention in decoder masks some tokens since if it knows the values in advance, it will be cheating
Input vectors in encoder are embedded to low-level tokens from word embedding such as 'word2vec' and 'Glove'
Input sequence in Transformer is not inserted sequentially unlike RNN. So, in order to get position data from input sequence, we manually add position information on input sequence by positional encoding
Positional encoding is calculated through cos, sin function
Usually, Sequence data contains a variety of information. 'Multi-Head Attention' concept was introduced from the idea that it would be better to use different multiple attention layers because all information cannot be expressed if only one attention layer is used.
For example, in the sentence 'I kicekd the ball', the word 'kiced' has different diverse meanings for the word 'who, did what, which, and to whom'. In the following example here, three attention layers are used to represent the meanings of 'kiced'. But if it only uses 'green' attention layer, 'kicked' does not have meaning for who the ball was kiced or what it kiced, so it is limied in representation.
Sequence contains diverse information, so only using one 'Attention' layer cannot reflect all information
Residual connection
By using residual connection, it can pass gradient well while on backpropagation
Layer Normalization
Unlike Batch Normalization, layer normalization normalizes values that correspond to same index on batch
Entering size: (size of length, d_model = 526)
After first linear transformation: (d_modle, d_ff)
After second linear transformation: (d_ff, d_model)
Output size: remains same as 'entering size'
Attention in Decoder
1) Encoder-Decoder Attention
- K,V are from Encoder's output but uses Q from Decoder
2) Decoder Self-Attention
- Works same as Encoder's Self-Attention module but it masks tokens from the later point of view.
ex) decoder's current state: 't', maksing tokens: from 't+1' to the end
- While on training, in order to use 'Teacher Forcing' it gets all tokens at once
Unconditional Image Generation
If Image Transformer uses high dimension images such as (32x32), the length of image pixel is defiend as (32x32xRGB(3) = 3072) which needs high complexity calculating Self-Attention which computes O(3072x3072xd)
In reality, it can be seent that self-attention cannot be utilized in super resolution tasks
Steps for Local 2D Attention
1) Set Query block which also follows raster scan order
2) Last generated pixel on Query block is query
3) Set Memory block which surrounds Query block on side and upward
4) Pixels on Memory block are keys and values.
5) Self-Attention with following Q, K, V
6) With following results, after Transformer’s Decoder Attention and FFNN, output pixels are generated
generate q' pixel from decoder which uses q as query pixel
Experiment Results
Unconditional Music Generation
We can know weighted average on sequene from self-attention explained above. However, we cannot know the distance between Q and K tokens. While generating music, we should consider frequency and repetitive concepts so distance between Q and K tokens is essential.
In order to calcuate distance between Q and K tokens, it introduces new attention concept
Reflect distance between Query and Key on attention weight
**Relative Attention = Multi-head Attention + Convolution**
Experiment Results
Reference
https://blog.promedius.ai/transforme
https://m.blog.naver.com/sogangori/221035995877r/
https://wikidocs.net/31379
https://velog.io/@tobigs-text1415/Lecture-14-Transformer-and-Self-Attention
https://velog.io/@tobigs-text1314/CS224n-Lecture-14-Transformer-and-Self-Attention
고려대학교 산업경영공학과 DSBA 연구실 CS224n Winter 2019 세미나 중 14. Transformers and Self-Attention For Generative Models 강의자료와 강의 영상 (노영빈님)
CS224n: Natural Language Processing with Deep Learning in Stanford
16기 주지훈
Transformers and Self-Attention
Advantages
Trivail to parallelize so it can insert input sequence all at once
Solve out long-term dependency problem since it can catch long-term dependency regardless of the distance between words that are far away
Self-Attention procedures
1) Dot product between Q,k
2) Scaling with 'd_k'
3) Appyling softmax function
4) Weighted sum with V
Encoder Structure
1) Input - word2vec (word embedding)
2) Positional Encoding
3) Self-Attention (explained above)
4) Multi-Head Attention: learning 8 attention layers while setting different initial values
5) Residual connection + Layer Normalization
6) Point-Wise Feed-Forward Networks
Attention in Decoder
1) Encoder-Decoder Attention
2) Decoder Self-Attention
16기 이승주
RNN
순차적 계산으로 인한 병렬화 불가능과 Long-term dependency problem이 존재한다.
CNN
병렬화 가능하고 Local dependency에는 강하지만, Long-term dependency를 표현하기 위해서는 많은 계층이 필요하다는 단점이 있다.
Self-Attention
병렬화 가능하고 각 token이 최단거리로 연결되어 long-term dependency 문제도 해결한다.
1) Q,K,V 벡터 얻기
2) Scaled dot-product Attention 수행
3) Head 통합하기
4) Fc layer 통과하기
Transformer
encoder와 decoder self-attention과 encoder-decoder attention의 구조를 지닌다. Transformer는 Language 뿐만 아니라 Image, Music 등의 분야에도 다양한 활용사례가 있다.