paper code
architecture
LLaVA는 Linear layer로 image와 language representation을 연결, Flamingo는 gated cross-attention, BLIP-2는 Q-former를 사용함.
in Training
Benchmark: Science QA(mutimodal reasoning dataset)
Example
model training에 사용된 input sequence