[Paper Review] Transformer: Attention is all you need

gredora·2023년 3월 2일

Paper Review

목록 보기
1/20

Input embedding

  • making input to matrix
    • each words match with indexes
    • each index has its own vector (size is 512 in this paper)
    • the similiar features are , the closer embedding distance is

Sequential, Parallel

  • RNN and LSTM process things in sequential order
  • Tranformer is using parallel way
    • which means faster
    • but, cannot know position of words

→ So, using Positional Encoding

Positional Encoding

  • Embeddingn+PositionalInfonEmbedding_n+PositionalInfo_n
  • 2 rules
    1. positional value should be same regardless of inputs
    2. positional value should not be that big
  • Sine & Cosine for PE
    • PE is vector, having several cycles

      PE(pos,2i)=sin(pos/100002i/dmodel)PE_(pos,2i_) =sin(pos/10000^{2i/d_{model}})
      PEpos,2i+1=cos(pos/100002i/dmodel)PE_{pos,2i+1}=cos(pos/10000^{2i/d{model}})

Operation between input embedding and positional encoding

  • summation
    • cost problem
  • concatenation
    • well balanced

→with enough computing power, you can use summation

Multi head Attention

using attention, give high cost on several words which should be focused on

Self attention

  • Value follows how similar query and key are
  • query → source / key → target

Linear Layer

Attention score

  • using dot product, get attention score
  • Softmax(Q dot K transpose divided by L2 norm) dot Value

Multi head Attention

  • parallel way using self attention
  • which can focus on several different relationships
profile
그래도라

0개의 댓글