[Paper Review] Transformer: Attention is all you need

gredora·2023년 3월 2일

Paper Review

목록 보기

1/20

Input embedding

making input to matrix
- each words match with indexes
- each index has its own vector (size is 512 in this paper)
- the similiar features are , the closer embedding distance is

Sequential, Parallel

RNN and LSTM process things in sequential order
Tranformer is using parallel way
- which means faster
- but, cannot know position of words

→ So, using Positional Encoding

Positional Encoding

$Embedding_n+PositionalInfo_n$
2 rules
1. positional value should be same regardless of inputs
2. positional value should not be that big
Sine & Cosine for PE
- PE is vector, having several cycles
  $PE_(pos,2i_) =sin(pos/10000^{2i/d_{model}})$ $PE_{pos,2i+1}=cos(pos/10000^{2i/d{model}})$

Operation between input embedding and positional encoding

summation
- cost problem
concatenation
- well balanced

→with enough computing power, you can use summation

Multi head Attention

using attention, give high cost on several words which should be focused on

Self attention

Value follows how similar query and key are
query → source / key → target

Linear Layer

Attention score

using dot product, get attention score
Softmax(Q dot K transpose divided by L2 norm) dot Value

Multi head Attention

parallel way using self attention
which can focus on several different relationships

그래도라

다음 포스트

[Paper Review] LSTM-CNN Architecture for Human Activity Recognition

0개의 댓글