- making input to matrix
- each words match with indexes
- each index has its own vector (size is 512 in this paper)
- the similiar features are , the closer embedding distance is
Sequential, Parallel
- RNN and LSTM process things in sequential order
- Tranformer is using parallel way
- which means faster
- but, cannot know position of words
→ So, using Positional Encoding
Positional Encoding
- Embeddingn+PositionalInfon
- 2 rules
- positional value should be same regardless of inputs
- positional value should not be that big
- Sine & Cosine for PE
→with enough computing power, you can use summation
Multi head Attention
using attention, give high cost on several words which should be focused on
Self attention
- Value follows how similar query and key are
- query → source / key → target
Linear Layer
Attention score
- using dot product, get attention score
- Softmax(Q dot K transpose divided by L2 norm) dot Value
Multi head Attention
- parallel way using self attention
- which can focus on several different relationships