Visual Instruction Tuning (NeurIPS 2023)

박상우·2024년 2월 29일

Paper Review

목록 보기

42/51

Multi-modal vision language model을 만드는 것은 core inspiration
Language는 image content를 describe 함
- 이는 image가 visual signal을 language semantic으로 전환할 수 있음을 뜻함
- 이는 보통의 인간의 소통과 유사
LLM은 wider role을 가짐
- 단 하나의 단점은 text-only 라는 것
우리는 visual instruction-tuning을 제시
- Multimodal instruction-following data: GPT4를 통해 image-text pair의 적절한 intruction following data를 수집
- Lage multimodal models: visual encoder CLIP과 language decoder Vicuna를 결합한 multimodal model 개발
- Multimodal instruction following benchmark: LLaVA-Bench라는 두 benchmark set 개발
- Open-source: 이는 모두 오픈 소스로 공개됨

기존 Instruction following data는 (CC, LAION)은 양이 총체적으로 부족
- Time consuming / less well-defined
GPT의 text annotation tasks를 통해, 우리는 GPT4를 multimodal instruction-following data collector로 활용
image $X_v$ , caption $X_c$ , question $X_q$
이러한 pair는 construct하기 쉬우나, diversity와 in-depth reasoning에서 한계가 존재
이를 해결하기 위해, GPT-4와 ChatGPT를 활용
- image를 text-only GPT의 input으로 활용하기 위해 symbolic representation을 활용
- Caption과 bounding box
- 이를 통해 LLM-recognizable sequence로 encoding
- 우리는 COCO image를 통해 three type의 instruction-following data를 제작

Filtering을 통해 naive한 single-turn instruction-following data 생성
Visual Encoder와 Text Decoder가 freeze된 상태로, projection matrix를 학습
- frozen LLM이 visual token을 잘 이해하도록 하는 것

Visual Encoder는 항상 freeze인 상태로, projection layer와 LLM을 학습
두 개의 scenario를 가정하고 학습
- Multimodal Chatbot
  - conversation을 multi-turn으로, 나머지 type을 single-turn으로 학습
- Science QA
  - Science QA benchmark를 활용하여 task solving