RNN, Attention - 시각적 이해를 위한 머신러닝 3

zzwon1212·2024년 7월 11일

0

딥러닝

목록 보기

27/33

11. Recurrent Neural Networks

Sequential Data
label y는 task에 따라 single일 수도, sequence일 수도 있다.
- Image Captioning
- Visual Question & Answering (VQA)
- Visual Dialog (Conversation about an Image)
- Visual Language Navigation
Types of Neural Networks
- One-to-one
- Many-to-one
- One-to-many
- Many-to-many
- Sequence-to-sequence
Internal State
At each step, the new internal state is determined by its old state as well as the input (feedback loop).
The same function ( $f$ ) and the same set of parameters ( $\mathrm{W}$ ) are used at every time step.

$h_t = f_\mathrm{w} (h_{t-1}, x_t) \\ \, \\ h_t = \mathrm{tanh} (\mathrm{W}_{hh} h_{t-1}, \mathrm{W}_{xh} x_t)$
- For binary classification (many-to-many) $\hat{y}_t = \sigma(\mathrm{W}_{hy} h_t)$
- For regression (many-to-many) $\hat{y}_t = \mathrm{W}_{hy} h_t$
Multi-layer RNN
LSTM
- cell state
- forget gate
- input gate
- output gate
GRU

12. RNN-based Video Models

RNN-based Spatio-Temporal Modeling
- LRCN (Long-term Recurrent Convolutional Network)
- Beyond Short Snippets
- ConvLSTM
  - input, hidden, cell, weight (w, u, v)가 모두 2차원
  - FC 대신 CNN을 사용
- ConvGRU

Attention Mechanism

RNN has Information Loss problem.

Attention Summary
- Query: decoder hidden state $s_0$
- Key, Value: encoder hidden states $\{h_1, h_2, h_3, ...\}$
- Attention Value: weighted average of encoder hidden states
  - Weights: similarity to $s_0$ (attention coefficients)
Attention-based Video Models
- MultiLSTM
  - Query: previous hidden state $h_{i-1}$ of LSTM
  - Key, Value: $N$ recent input frame features
  - Attention value: weighted sum of recent $N$ frame features
- Visual Attention
  - Spatial attention
    "Where should we focus on the 2D image space to classify the video correctly?"
    Spatial attention provides interpretability.
    - $\mathrm{l}_t$ : spatial attention coefficients
    - $\mathrm{X}_t$ : the last conv-layer representation of an input image
  - Query: previous hidden state of the last LSTM ( $h_{t-1}$ )
  - Key, Value: $K \times K$ regional features from input $\mathrm{X}_t$
  - Attention value: weighted sum of region features
    - Weights: proportional to relevance to $h_{t-1}$

📙 강의

이준석 - 시각적 이해를 위한 머신러닝

JUST DO IT.

이전 포스트

CNN - 시각적 이해를 위한 머신러닝 2

다음 포스트

Transformer - 시각적 이해를 위한 머신러닝 4

0개의 댓글

관련 채용 정보