- RNN
- LSTM

: slow to train

Can we parallelize sequential data?

Input sequence can be transmitted **parallel
**

No concept of time step

Pass all the words simultaneously and determine the word embedding simultaneously

(RNN passes input word one after another)

In embedding space, close-meaning words locates close to each other

There're already pretrained embedding spaces.

But, same word in a different sentence has a different meaning!

: vector that gives context **information based on position of word in a sentence**

Can use sin/cos function to generate PE, but any reasonable function is ok

: What part of the input should we focus?

How much the word 'The' is relevant to other words(big, red, dog) in the same sentence?

Attention Vetors(of English) contain **contextual relationships betweeen the words** in the sentence.

Simple feed forward network is applied to every one of the attention vectors!

Focuses to much on itself..

We want to know the interactions and relationships between words!

➡ **Use mulitple attention vector** for the same word and average them: Multi Head Attention Block

(Q. vectors from different sentence..?)

Attention vectors are feeded to Feed Forward Network one vector at a time

Each of the attention vector of different word is **independent** each other

➡ can Parallize Feed Forwared Network!

➡ All words can be passed to the encoder block at the same time and output is a set of encoded vectors for every word

In English -> French task, we feed output French to the decoder

Generates attention vectors(of French) showing how much each word is related to another

Attention vectors from both Encoder(English) and Decoder(French) are passed to another **Encoder-Decoder Attention block**.

➡ output of this block: Attention vector of all words(English+French)

➡ Each attention vector shows the realtionship of other words including both languages

➡ English to French word **mapping** happens!

If we use all the words in the French sentence, there'd be no learning, just spitting out the next word

➡ **mask input**: observe only previous and itself

V,K,Q: abstract vector that extracts different components of input words

We have V,K,Q vectors for every single word

➡ create attention vector for every word using V, K, Q

Have multiple weight matrices(Wv, Wk, Wq)

➡ multiple attention vectors for every word

➡ another weighted matrices(Wz)

➡ now feed forward nn can be fed only one attention vector per word

Pass each attention vector to feed forward unit

: another Feed Forward Layer

Used to expand the dimension to the number of words in French

Transforms it into Probability Distribution

Output: The word with the highest probability to come next

Reference to https://github.com/hyunwoongko/transformer

Scale Dot Production Attention

```
class ScaleDotProductAttention(nn.Module):
"""
compute scale dot product attention
Query : given sentence that we focused on (decoder)
Key : every sentence to check relationship with Qeury(encoder)
Value : every sentence same with Key (encoder)
"""
def __init__(self):
super(ScaleDotProductAttention, self).__init__()
self.softmax = nn.Softmax(dim=-1)
def forward(self, q, k, v, mask=None, e=1e-12):
# input is 4 dimension tensor
# [batch_size, head, length, d_tensor]
batch_size, head, length, d_tensor = k.size()
# 1. dot product Query with Key^T to compute similarity
k_t = k.transpose(2, 3) # transpose
score = (q @ k_t) / math.sqrt(d_tensor) # scaled dot product
# 2. apply masking (opt)
if mask is not None:
score = score.masked_fill(mask == 0, -e)
# 3. pass them softmax to make [0, 1] range
score = self.softmax(score)
# 4. multiply with Value
v = score @ v
return v, score
```

Multi-head Attention

```
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_head):
super(MultiHeadAttention, self).__init__()
self.n_head = n_head
self.attention = ScaleDotProductAttention()
self.w_q = nn.Linear(d_model, d_model)
self.w_k = nn.Linear(d_model, d_model)
self.w_v = nn.Linear(d_model, d_model)
self.w_concat = nn.Linear(d_model, d_model)
def forward(self, q, k, v, mask=None):
# 1. dot product with weight matrices
q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)
# 2. split tensor by number of heads
q, k, v = self.split(q), self.split(k), self.split(v)
# 3. do scale dot product to compute similarity
out, attention = self.attention(q, k, v, mask=mask)
# 4. concat and pass to linear layer
out = self.concat(out)
out = self.w_concat(out)
# 5. visualize attention map
# TODO : we should implement visualization
return out
def split(self, tensor):
"""
split tensor by number of head
:param tensor: [batch_size, length, d_model]
:return: [batch_size, head, length, d_tensor]
"""
batch_size, length, d_model = tensor.size()
d_tensor = d_model // self.n_head
tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)
# it is similar with group convolution (split by number of heads)
return tensor
def concat(self, tensor):
"""
inverse function of self.split(tensor : torch.Tensor)
:param tensor: [batch_size, head, length, d_tensor]
:return: [batch_size, length, d_model]
"""
batch_size, head, length, d_tensor = tensor.size()
d_model = head * d_tensor
tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)
return tensor
```

REF

https://github.com/hyunwoongko/transformer

https://www.youtube.com/watch?v=TQQlZhbC5ps