45일차 딥러닝 8 Attention Mechanism

차지예·2025년 7월 17일

attention mechanism 개발자 논문 분석 딥러닝 생성형AI

생성AI

목록 보기

40/56

📚 Attention Mechanism 관련 주요 논문 분석

1. 바다나우 어텐션 (2015)

논문 제목: Neural Machine Translation by Jointly Learning to Align and Translate

저자: Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

✅ 논문의 목적

기존 Encoder–Decoder 구조에서 고정 길이 벡터에 전체 문장을 압축하면, 긴 문장에서 성능이 급격히 저하됨.
정렬(Alignment)과 번역(Translation)을 동시에 학습하는 모델을 제안함.

🔧 결과 도출 방식 (방법론)

Encoder는 Bidirectional RNN 사용.
각 입력 단어 위치마다 annotation vector $ h_j $를 생성.
Decoder는 각 출력 단어 $ yi $ 생성을 위해 이전 상태 $ s{i-1} $, 이전 단어 $ y_{i-1} $, context vector $ c_i $를 이용함.
Context vector 계산:

c_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}, \quad e_{ij} = a(s_{i-1}, h_j)

Alignment 모델 $a$ 는 Feedforward Neural Network로 구성됨.

🏁 논문 결과

WMT’14 English-French 번역 task에서 기존보다 높은 BLEU 점수 달성.
특히 긴 문장에 대해 성능이 안정적이며, soft attention weight를 시각화해 언어적 정렬이 잘 학습된 것 확인.

2. 루옹 어텐션 (2015)

논문 제목: Effective Approaches to Attention-based Neural Machine Translation

저자: Minh-Thang Luong, Hieu Pham, Christopher D. Manning

✅ 논문의 목적

기존 바다나우 모델 외에도 다양한 attention 구조를 실험적으로 비교하고자 함.
Global attention과 Local attention 구조를 제안.

🔧 결과 도출 방식 (방법론)

Encoder–Decoder는 Stacked LSTM 구조 사용.
Input-feeding 방식: 이전 attention 결과 $\tilde{h}_{t-1}$ 을 다음 time step의 입력으로 사용.
Global Attention Score Function 종류:

\text{score}(h_t, \bar{h}_s) = \begin{cases} h_t^\top \bar{h}_s & \text{(dot)} \\\\ h_t^\top W_a \bar{h}_s & \text{(general)} \\\\ v_a^\top \tanh(W_a [h_t ; \bar{h}_s]) & \text{(concat)} \end{cases}

Local Attention (Predictive):

p_t = S \cdot \text{sigmoid}(v_p^\top \tanh(W_p h_t))

a_t(s) = \text{align}(h_t, \bar{h}_s) \cdot \exp\left(-\frac{(s - p_t)^2}{2\sigma^2}\right)

Local Attention은 일정한 window 내에서 soft attention을 수행함.

🏁 논문 결과

WMT’15에서 25.9 BLEU로 당시 SOTA 달성.
Global보다 Local이 더 효율적이고, 다양한 alignment 함수 중 general 방식이 가장 좋은 성능 보임.
Non-attentional 대비 최대 5 BLEU 향상.

3. 스케일드-닷프로덕트 어텐션 (Vaswani et al., 2017)

논문 제목: Attention Is All You Need

저자: Ashish Vaswani et al. (Google Brain/Research)

✅ 논문의 목적

RNN 기반 구조의 병렬성 한계 극복을 위해 attention만 사용하는 모델 구조인 Transformer를 제안.

🔧 결과 도출 방식 (방법론)

Self-Attention만으로 구성된 Encoder–Decoder 구조.
각 attention head는 다음 수식으로 계산됨:

\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right) V

여러 attention head를 병렬 처리:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) W^O

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Positional Encoding은 다음과 같이 구성됨:

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right), \quad PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

🏁 논문 결과

WMT’14 English-German: BLEU 28.4, English-French: BLEU 41.8 달성
학습 시간은 단 3.5일 (8 GPU) → 기존 대비 훨씬 효율적
Self-Attention의 병렬성과 성능 우수성 입증

🔍 세 논문 비교 요약

항목	Bahdanau (2015)	Luong (2015)	Vaswani (2017)
목적	고정 벡터 한계 해결	다양한 Attention 구조 실험	RNN 없이 Attention만으로 학습
구조	BiRNN + Soft Attention	Stacked LSTM + Global/Local	Transformer (Self-Attention)
Attention 방식	Additive (FFNN 기반)	Dot/General/Concat + Gaussian	Scaled Dot-Product + Multi-Head
결과	긴 문장에서 BLEU 향상	SOTA 달성 (25.9 BLEU)	BLEU 최고 성능 (28.4+) + 빠른 학습

이전 포스트

43일차 딥러닝 6 BiLSTM-CRF, BiLSTM-CNN

다음 포스트

46일차 딥러닝9 Transformer

0개의 댓글