LLaVa-NeXT

YEOM JINSEOPยท2024๋…„ 10์›” 4์ผ
0

Multi-modal LLMs

๋ชฉ๋ก ๋ณด๊ธฐ
3/4
post-thumbnail

page
inference code

LLaVa-1.5์™€ ๋น„๊ต (2024-01 Release)

  • higher input image resolution

    • to 4x more pixels.
    • allow it to grasp more visual details.
    • supports 3 aspect ratios,
      up to 672x672, 336x1344, 1344x336 resolution.
  • improved visual instruction tuning data mixture

    • better visual reasoning and OCR capability
  • better visual conversation for more scenarios

  • Efficient deployment and inference with SGLang(framework)


(2024-05 Release)

  • stronger & larger language models

    • LLaMA3 (8B), Qwen-1.5(72B, 110B)
  • ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹์ธ LLaVA-Bench (Wilder)๋ฅผ ์ˆ˜์ง‘ ๋ฐ ๊ฐœ๋ฐœ

    • ์‹ค์ƒํ™œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๋‹ค์–‘ํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ๊ฐœ์„ ๋œ multi-modl ๊ธฐ๋Šฅ์„ ํ‰๊ฐ€.
  • motivation

    • ์ง€๋‚œ 1์›” ๊ณต๊ฐœํ–ˆ๋˜ ๋ชจ๋ธ์€ ๋‹น์‹œ ์ตœ๊ณ ์˜ LLM์ธ Yi-34B๋ฅผ ํ™œ์šฉ.
    • ์ตœ๊ทผ ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ๋Š” LLaMA3 ๋ฐ Qwen-1.5 ์‹œ๋ฆฌ์ฆˆ์™€ ๊ฐ™์ด ์–ธ์–ด ๋Šฅ๋ ฅ์ด ๊ฐ•ํ™”๋œ ์˜คํ”ˆ์†Œ์Šค LLM๋“ค์ด ๋“ฑ์žฅ.
      ๋™์‹œ์—, OpenAI GPT-V์™€ ๊ฐ™์€ ๋…์  LMM๋“ค์ด GPT-4์™€ ๊ฐ™์€ ๊ฐ•๋ ฅํ•œ LLM์˜ ์ง€์›์„ ๋ฐ›๊ณ  ์žˆ๋‹ค๋Š” ์ถ”์ธก๋„ ์žˆ์Œ.
    • ์ด๋กœ ์ธํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์งˆ๋ฌธ์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ œ๊ธฐ๋จ.
      ๊ฐ•๋ ฅํ•œ ์ƒˆ๋กœ์šด ์–ธ์–ด ๋ชจ๋ธ์˜ ๋„์ž…์œผ๋กœ ์˜คํ”ˆ์†Œ์Šค์™€ ๋…์  LLM ๊ฐ„์˜ ๊ฒฉ์ฐจ๊ฐ€ ์ค„์–ด๋“ค๋ฉด์„œ, ์ด๋Ÿฌํ•œ ๊ฐ•๋ ฅํ•œ LLM๋“ค์— ์˜ํ•ด ๊ตฌ๋™๋  ๋•Œ ์˜คํ”ˆ์†Œ์Šค์™€ ๋…์  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ ๊ฐ„์˜ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๋„ ์ค„์–ด๋“œ๋Š”๊ฐ€?

0๊ฐœ์˜ ๋Œ“๊ธ€