Multimodal LLM 전반적인 내용 정리
Contents
https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2
LLaMA-VID paper 사진:
주로 저런 N각형 평가를 하는 듯 함
DOLPHINS: MULTIMODAL LANGUAGE MODEL FOR DRIVING | arxiv 2312LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models | arxiv 2311LLAVA-PLUS: LEARNING TO USE TOOLS FOR CREATING MULTIMODAL AGENTS | arxiv 2311 | git
Work performed during an internship at Microsoft
→ multimodal instruction-following data로 학습
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration | arxiv 2311 | git
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions | arxiv 2311 | git
→ GPT4-V로부터 얻어진 100K high-quality caption을 1.2M개로 확장시킨 SharGPT4V data를 가지고 SFT 했더니 잘 동작함.
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models | arxiv 2310 | projLLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing | arxiv 2311 | git | Microsoft
→ multimodal humain-AI interaction을 위한 research prototype으로 visual chat LLaVA + image segmentation SEEM + image generation/editing GLIGEN으로 구성
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning | arxiv 2309