CS224N Lecture 9

Sunmin Lee·2023년 4월 5일

building block을 바꾸고 싶다

같은 목적을 가지고 있지만 다른 특징을 가지고있는 여러 building block을 살펴보자

Word window
: aggregate(집합하다) local context

  • word window classifiers -> use to represent information about center word

  • 1D convolution(later!)

  • sequence length만큼 길어지지 않음

  • word window classifier -> build a representation of the word take into account(고려하다) its local context

  • local window를 독립적으로 보고 word h를 연산할 수 있음

  • layer를 쌓을 수 있음(LSTM encoder처럼 보임)

  • 미래 정보를 못보게 window를 자르면 language model의 decoder와 같음

  • 그러나 여전히 long-distance dependency를 가지고 있음
    - maximum interaction distance = sequence length / window size

    • window size = local n words
    • stack windows without growing window size -> pretty far (그러나 구조적 한계 존재함)

(contextualizer : 맥락화)

Attention

  • Query : word's representation, try to sort of access information from set of values

  • block 구조 -> every embedding layer attends to next layer word
    (all words attend to all words in previous layer)

Attention equation

eij = scalar values not bounded in(-에 묶여) size

  • Difference from fully connected layer
    : interaction weights (weights in fully connected layer -> allow to change as a function of the input)
    attention : interaction between key and query vectors (dependent on the actual content)
    parametization? = diffenent!
    independent connection weight
    1) dynamic connectivity
    2) has inductive bias (does not connect everything to everything feed forward)

0개의 댓글