LLaVa

YEOM JINSEOP·2024년 9월 25일

Multi-modal LLMs

목록 보기

1/4

LLaVa의 한계 (LLaVa 1.5에서 지적)
- 단답형(short-form) 답변을 요구하는 academic benchmarks에서 부족한 성능을 보이며,
- 학습 데이터 분포에 해당 데이터가 부족하여 yes/no 질문에 대해 'yes'로 답변하는 경향

architecture
- LLM $f_{\phi}(\cdot)$
  - Vicuna
- VIsion Encoder $g(\cdot)$
  - pre-trained CLIP visual encoder ViT-L/14
- Linear layer $W$
  - image feature를 word embedding space로 projection.
  - trainable projection matrix $\bold{W}$
  - language embedding tokens $H_v$ 는 word embedding space와 동일한 dimension을 가짐.
LLaVA는 Linear layer로 image와 language representation을 연결,
Flamingo는 gated cross-attention,
BLIP-2는 Q-former를 사용함.
in Training
- Stage 1: pre-training for feature alignment
  - 목적: aligning image features $\bold{H}_v$
    with the pre-trained LLM word embedding.
  - frozen ❄️: visual encoder weights, LLM weights
  - update🔥: projection layer weights $\bold{W}$
  - formula: maximize the likelihood of Eq(3) with trainable parameters $\bold{\theta} = \bold{W}$
    - Eq(3)

Stage 2: Fine-tuning End-to-End
- frozen ❄️: visual encoder
- update🔥: projection layer weights $\bold{W}$ , LLM weights
- trainable parameters: $\bold{\theta} = \bold\{\bold{W}, \phi\}$