[논문 리뷰] BLIP-2 Thesis Review

한의진·2024년 9월 23일

스터디_리뷰

목록 보기

14/15

Abstract

end-to-end train 비용때문에 vision-language pre-training의 cost는 매우 높다.

BLIP-2의 경우에는 vision-language task에서 이상적인 퍼포먼스를 보여준다.

예를 들어, zero-shot VQAv2의 경우에는 Flamingo80B를 8.7% 앞섰다.

Introduction

VLP 분야는 최근 몇 년간 급격한 발전 ⇒ 그러나 high Computational cost가 단점으로 꼽힌다.

vision-language model을 간단히 unimodal 모델로부터 달성할 수 있다.

catastrophic forget을 방지하기 위해 pre-trained model은 사전 학습동안 고정된다..

Flamingo의 경우 modality gap을 bridge하기에는 부족함이 있다.

Querying Transformer

frozen image encoder에서 visual features을 추출하기 위한 query vector를 employ한 transformer가 Q-Former이다.

Pre-Training Stage

Q-Former가 text에 가장 가까운 visual representation을 학습
vision-to-language generative learning(frozen LLM)

BLIP-2는 zero-shot image-to-text-generation에 prompt될 수 있다.

(visual conversion 등)

BLIP-2의 경우에 Flamingo를 54배 적은 데이터양에도 성능을 초월함.

Method

Architecture

fixed number of output features
independent of input image resolution

Image Transformer(fixed image encoder), Text Transformer(Encode, Decode) 두 부분으로 구성

self-attention layer를 통해 상호작용

pre-training task에 따라 mask를 다르게 적용하였다.

(Bi-directional, Multi-modal, Uni-modal)

Self Attention Task

32X768 쿼리(이미지 직접 추출보다 효율적)

parameter들은 BERT에서 이미 학습된 데이터 활용

Representation

Image-Text Contrastive Learning:

positive pair, negative pair의 유사성을 대비하면서 정렬

Image와 Text가 leak 방지를 위해 unimodal로서 서로 상호작용하지 않는다.

Image-grounded Text Generation:

frozen image encoder와 text token사이에 직접적 상호작용 허용하지 않음.

텍스트 생성을 위한 정보를 쿼리로부터 추출해야 하며, self-attention layer를 통해 토큰으로 전달.

decoding task를 signal하기 위해 CLS대신 DEC토큰 사용

Image-Text Matching:

bi-directional self-attention mask

모든 쿼리와 텍스트가 각각 서로 attach할 수 있음.

Generation

text embedding과 같은 차원으로 FC를 통해 선형적으로 project함.

LLM의 text-embedding 앞에 붙인다.

decoder-based:

language modeling loss기반 학습

encoder-decoder based:

prefix language modeling loss 기반 학습

텍스트를 2개의 부분으로 나눈다.

정리

이 논문의 아이디어를 짧게 정리하면, 기존 모델들의 high Computational Cost 문제를 해결하기 위하여 Q-Transformer를 제안하였습니다. Q-Former가 Text와 가장 가까운 visual representation을 학습하고, 이것을 LLM을 이용해 생성적으로 학습하는 방법을 활용하였습니다. Representation에서는 Token을 이용하여 각각의 학습 목적에 맞게 ITC, ITG, ITM의 다른 방법을 사용하였고, 이것의 결과물은 Fully Connected Layer를 통해 선형적으로 project하여 text-embedding 앞에 붙여 처리였습니다.

Encoder Decoder-Based에서 Text를 suffix와 prefix 두 부분으로 나누어 학습에 활용하며, LLM encoder의 generation target이 된다고 논문상에 서술되어 있는데, 이렇게 되는 원리나 이유의 이해에 약간의 어려움이 있었습니다.

한의진

이전 포스트

[논문 리뷰] LXMERT Thesis Review

다음 포스트