Multimodal LLM Study

alina·2023년 12월 4일
2

Paper Review

목록 보기
2/4

Multimodal LLM 전반적인 내용 정리


Contents


GIT



Evaluation


Paper 한줄 요약

Instruction Tuning

  • DOLPHINS: MULTIMODAL LANGUAGE MODEL FOR DRIVING | arxiv 2312
    University of Wisconsin-Madison, NVIDIA, University of Michigan, Stanford University
    → OpenFlamingo 모델에 Grounded Chain of Thought 거쳐서 driving-specific data로 instruction tuning 함.

  • LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | arxiv 2311
    → 특정 시간 t에서의 text-guided context token과 visual 정보가 많은 content token이 합쳐져 linear projector를 통과해 특정 시간 t의 token을 나타내고 이를 LLM으로 통과시킴.

  • LLAVA-PLUS: LEARNING TO USE TOOLS FOR CREATING MULTIMODAL AGENTS | arxiv 2311 | git
    Work performed during an internship at Microsoft
    → multimodal instruction-following data로 학습

  • mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | arxiv 2311 | git

  • ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | arxiv 2311 | git
    → GPT4-V로부터 얻어진 100K high-quality caption을 1.2M개로 확장시킨 SharGPT4V data를 가지고 SFT 했더니 잘 동작함.


Multimodal Chain-of-Thought

  • DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | arxiv 2310 | proj
    → general multimodal rationales를 생성하기 위한 CoT 진행.


LLM-Aided Visual Reasoning

  • LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing | arxiv 2311 | git | Microsoft
    → multimodal humain-AI interaction을 위한 research prototype으로 visual chat LLaVA + image segmentation SEEM + image generation/editing GLIGEN으로 구성

    업로드중..

  • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | arxiv 2309

0개의 댓글