Transformer-Attention

박찬호·2025년 10월 24일

Prior Knowledge

Tensor, Even pass Weight Preservation Info

Tensor: Indicate Info, furthermore, even pass the weight preserve previous Info

Weighted sum

The sum of weights in any set is equal to the set value, also can adjust each weights towards each tensors

Inner Product (내적)

Multiply all the same dimensional values of the vector and add them, The higher the Correlation, the higher value of the muliplycation.

Attention

QKV and Attention Logic

Queries, Keys, Values

Wq-Wk-Wv

Wq-Wk (Inner Product) / How does keys are simillar with Query
Wq-Wk (Weighted Sum) / Exponential, Simplification
Wv -> Result Value

Transformer

Limitation of Inner Product

Cannot Distinguish HOMONYM,
Cannot Check Other Inner Prodcut
Cannot Know Sequence and Context

Squence

Positional Encoding

Position Vector Layer

Context

Self Attention
Context Encoder Layer

Masking

How does TRANSFORMER work?

Basic Structure: encoder-decoder

BERT(Encoder-only)
GPT(Decoder-only)

Translation Encoding Value -> Key, Value of Middle Attention from Transformer Decoder
Inffering the Nth Answer in Squence by REFERRING to the values up to N-1th

Parallelized Training

Problem of Paralleized Training

All tokens of Decoder knows every each tokens and Context Info.

Solution

Masking Backward(Future) Words.
Gives Extreme NEGATIVE Value of Key on Inner Product.

박찬호

Velog.

이전 포스트

Deep Dive Into LLM

다음 포스트