๐Ÿ’กMulti Modal ์†Œ๊ฐœ

oceannยท2024๋…„ 8์›” 28์ผ
0

๐Ÿ’ก๊ด€์‹ฌ์‚ฌ

๋ชฉ๋ก ๋ณด๊ธฐ
2/3
post-thumbnail

Multi Modal์ด๋ž€?

Modality๋ž€ โ€™์–‘์‹โ€™, โ€˜์–‘์ƒโ€™์ด๋ผ๋Š” ๋œป์œผ๋กœ, ๋ณดํ†ต ์–ด๋–ค ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ํ˜„์ƒ์ด๋‚˜ ๊ทธ๊ฒƒ์„ ๋ฐ›์•„๋“ค์ด๋Š” ๋ฐฉ์‹์„ ๋งํ•œ๋‹ค.
AI๊ฐ€ ๋“ฑ์žฅํ•˜๊ธฐ ์ „์—๋Š” ์‚ฌ์šฉ์ž์—๊ฒŒ ๋ณด์ด๊ฑฐ๋‚˜ ์ž…๋ ฅํ•˜๋Š” ๋ฐฉ์‹ ๋“ฑ์„ ํ•˜๋‚˜๋กœ ๋‹จ์ˆœํ™”ํ•˜์—ฌ ๊ตฌํ˜„ํ•œ ๊ฒƒ์„ Uni Modality๋ผ๊ณ  ํ–ˆ์œผ๋ฉฐ, ๋งˆ์šฐ์Šค์™€ ํ‚ค๋ณด๋“œ, ํ™”๋ฉด๊ณผ ์Œ์„ฑ ๋“ฑ ์—ฌ๋Ÿฌ ์ฑ„๋„์„ ์ด์šฉํ•˜๋ฉด Multi Modality๋ผ๊ณ  ํ–ˆ๋‹ค.
AI๊ฐ€ ๋“ฑ์žฅํ•œ ์ดํ›„์—๋Š” ์‚ฌ๋žŒ์ด ์–ด๋–ค ํ˜„์ƒ์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ์‹œ๊ฐ, ์ฒญ๊ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ์ž๋ฃŒ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต ๋ฐ ์‚ฌ๊ณ ํ•˜๋Š” ๋ฐฉ์‹์„ Multi Modal AI๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

๋“ฑ์žฅ ๋ฐฐ๊ฒฝ
High-Level, Low-Level์— ๋Œ€ํ•ด์„œ๋Š” ๋‹ค๋“ค ๋“ค์–ด๋ดค์„ ๊ฒƒ์ด๋‹ค. ์•„๋ฌด๋ฆฌ High-Level์ด๋ผ๋„ ์ง์ ‘ ์ปดํ“จํ„ฐ๋ฅผ ๋ฐฐ์›Œ์•ผ ํ•˜๊ณ , ์ด๋ฅผ ๋‹ค์‹œ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋ณ€ํ˜•ํ•ด์•ผ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณด๋‹ค ์‰ฝ๊ฒŒ ์ปดํ“จํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ ์ž ์ฆ‰, ์ปดํ“จํ„ฐ๊ฐ€ ์ธ๊ฐ„์˜ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๊ฒŒ ํ•˜๊ณ ์ž NLP๊ฐ€ ๋“ฑ์žฅํ–ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ์ปดํ“จํ„ฐ๊ฐ€ ๊ธฐ๋ณธ์ ์ธ ๋ช…์ œ์™€ ์ถ”๋ก ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค.
ํ•˜์ง€๋งŒ ์‚ฌ๋žŒ์ด ์ •๋ณด๋ฅผ ํ†ตํ•ด ์ง€์‹์„ ์Šต๋“ํ•˜๊ณ  ์†Œํ†ตํ•˜๋Š” ๊ณผ์ •์—๋Š” ์ž์—ฐ์–ด๋งŒ ๊ด€์—ฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. ์‹œ๊ฐ, ์ฒญ๊ฐ, ํ›„๊ฐ, ์ด‰๊ฐ, ๋ฏธ๊ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ๊ฐ๊ฐ์„ ์‚ฌ์šฉํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ปดํ“จํ„ฐ๊ฐ€ ์ธ๊ฐ„์˜ ์˜์‚ฌ์†Œํ†ต ๋ฐฉ์‹์„ ์ดํ•ดํ•˜๊ฒŒ ํ•˜๊ณ ์ž Multi Modal์ด ๋“ฑ์žฅํ–ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค๋ฉด, Vision Task๋Š” ์ธ๊ฐ„์˜ ์‹œ๊ฐ์„ ๋ชจ๋ฐฉํ•˜๊ณ , ์‹ ํ˜ธ ๋ฐ ์ฃผํŒŒ์ˆ˜ ๋ถ„์„์€ ์ฒญ๊ฐ์„, NLP๋Š” ์ž์—ฐ์–ด๋ฅผ ๋ชจ๋ฐฉํ•œ ๊ฒƒ์ด๋‹ค.

DALL-E
OpenAI๋Š” ํ…์ŠคํŠธ๋งŒ์œผ๋กœ๋Š” ์ •๋ณด๊ฐ€ ์‹ค์ œ๋กœ ์–ด๋–ค ํ˜•ํƒœ๋กœ ์กด์žฌํ•˜๋Š”์ง€ ์•Œ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— Multi Modal์„ ํ™œ์šฉํ•ด์„œ DALL-E๋ฅผ ๊ฐœ๋ฐœํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.
์‚ฌ๋žŒ์ด ์ž…๋ ฅํ•œ ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ ๊ทธ๋ฆผ๊ณผ ์‚ฌ์ง„์œผ๋กœ ์ถœ๋ ฅํ•˜์—ฌ ๋ชจ๋ธ์ด ํ…์ŠคํŠธ์˜ ๋‚ด์šฉ์„ ์–ด๋–ป๊ฒŒ ์ดํ•ดํ•˜๊ณ  ์žˆ๋Š”์ง€๋ฅผ ์‹œ๊ฐํ™”ํ•ด์ค€๋‹ค.

์˜ˆ์‹œ

DALL-E ์™ธ
GPT-4V ์ด๋ฏธ์ง€ ๋ถ„์„, ํ…์ŠคํŠธ๋กœ ์„ค๋ช…, ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์— ์‹œ๊ฐ์  ํฌ์ธํ„ฐ ์ถ”๊ฐ€, ๋น„๋””์˜ค ํ”„๋ ˆ์ž„ ๋ถ„์„ ๋“ฑ
LG์˜ ์—‘์‚ฌ์› ํ…์ŠคํŠธ๋ฅผ ์ด๋ฏธ์ง€๋กœ ํ‘œํ˜„, ์ด๋ฏธ์ง€๋ฅผ ํ…์ŠคํŠธ๋กœ ์„ค๋ช…ํ•˜๋Š” ์–‘๋ฐฉํ–ฅ(ํ˜„์žฌ GPT๋„ ๊ฐ€๋Šฅ)
Stable Diffusion ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ฐ ํŽธ์ง‘ ๋ชจ๋ธ

๋„ค์ด๋ฒ„์˜ ์Šค๋งˆํŠธ ๋ Œ์ฆˆ

์ถœ์ฒ˜
์‚ผ์„ฑ SDS ์ธ์‚ฌ์ดํŠธ ๋ฆฌํฌํŠธ
์‚ฌ์ง„: ๋„ค์ด๋ฒ„ ์ œ๊ณต


Multi Modal Dataset - AI Hub

1. ์ฐจ๋Ÿ‰ ๋‚ด ์ธํ„ฐํŽ˜์ด์Šค ๊ฐœ์„ ์„ ์œ„ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ
์ž์œจ์ฃผํ–‰, ์ธํฌํ…Œ์ธ๋จผํŠธ AI ์„œ๋น„์Šค์˜ ๊ฐœ๋ฐœ ๋ฐ ๊ณ ๋„ํ™”๋ฅผ ์œ„ํ•œ ์ฐจ๋Ÿ‰ ๋‚ด ํƒ‘์Šน์ž ์ƒํ™ฉ ์ธ์‹ ์˜์ƒ ๋ฐ์ดํ„ฐ์ด๋‹ค.

๋ฐ์ดํ„ฐ ๊ตฌ์กฐ
์ฐจ๋Ÿ‰ ๋‚ด ์ธํ„ฐํŽ˜์ด์Šค ์กฐ์ž‘์„ ์œ„ํ•œ ์ œ์Šคํ„ฐ ์ดฌ์˜ ์˜์ƒ
์˜์ƒ์—์„œ ์ถ”์ถœํ•œ ํ”„๋ ˆ์ž„ ์ด๋ฏธ์ง€
์˜์ƒ์—์„œ ์ถ”์ถœํ•œ ์Œ์„ฑ

์—ฌ๊ธฐ์„œ ์ธํฌํ…Œ์ธ๋จผํŠธ๋ฅผ ์ž ๊น ์„ค๋ช…ํ•˜๊ณ  ๊ฐ€์ž๋ฉด, info์™€ entertainment์˜ ํ•ฉ์„ฑ์–ด์ด๋‹ค. ์ •๋ณด๋ฅผ ์Šต๋“ํ•˜์—ฌ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณตํ•˜๋Š” ๊ณผ์ •์—์„œ ์‚ฌ์šฉ์ž์—๊ฒŒ ์žฌ๋ฏธ ์š”์†Œ๋ฅผ ๋”ํ•ด์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.
๊ฐœ์ธ์ ์œผ๋กœ ํ˜„์žฌ ํ•„์š”ํ•œ ๊ฒƒ๋“ค์€ ๋งŽ์ด ๊ฐœ๋ฐœ์ด ๋˜์—ˆ๊ธฐ์— ์•ž์œผ๋กœ๋Š” ์ฆ๊ฑฐ์šด ๊ฒƒ๋“ค์ด ์ฃผ๋ฅ˜๋ฅผ ์ฐจ์ง€ํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š” ๋‚˜๋Š” ์ธํฌํ…Œ์ธ๋จผํŠธ์— ๋˜ํ•œ ๊ด€์‹ฌ์ด ๊ฐ„๋‹ค.

2. ์ •์‹ ๊ฑด๊ฐ•์ง„๋‹จ ๋ฐ ์˜ˆ์ธก์„ ์œ„ํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ
ํ™˜์ž๊ตฐ๊ณผ ๊ฑด๊ฐ•๋Œ€์กฐ๊ตฐ์„ ๋Œ€์ƒ์œผ๋กœ ์ž„์ƒ ์˜๋ฃŒ ๋ฐ์ดํ„ฐ, ์ˆ˜๋ฉด ๋ฐ์ดํ„ฐ, ์Œ์„ฑ, ๋ผ์ดํ”„๋กœ๊ทธ ๋ฐ์ดํ„ฐ ํš๋“์„ ํ†ตํ•ด ์ตœ์‹  ์ธ๊ณต์ง€๋Šฅ ํ•™์Šต ๊ธฐ์ˆ  ์ ์šฉ์ด ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์ถ•ํ•œ ๊ฒƒ์ด๋‹ค.
์ •์‹  ์งˆํ™˜์€ ๊ฐ๊ธฐ์— ๊ฑธ๋ ค ๋ชธ์ด ์•„ํ”„๋“ฏ์ด ๋งˆ์Œ์ด, ๋จธ๋ฆฌ๊ฐ€ ์ž ์‹œ ์•„ํ”ˆ ๊ฒƒ์ผ ์ˆ˜๋„ ์žˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์•„์ง ์‚ฌํšŒ์ ์ธ ์ธ์‹์ด ๋ถ€์ •์ ์ด๋‹ค. ์ •๋„์™€ ์ข…๋ฅ˜์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒ ์ง€๋งŒ, ์กฐ๊ธฐ์— ๋น ๋ฅธ ์ง„๋‹จ์„ ํ†ตํ•ด ์น˜๋ฃŒ๊ฐ€ ํ•„์š”ํ•œ ์ƒํ™ฉ์„ ์œ„ํ•ด ๋ˆ„๊ตฌ๋‚˜ ํŽธ๋ฆฌํ•˜๊ฒŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ AI ์ƒ๋‹ด์‚ฌ์— ๋Œ€ํ•œ ์•„์ด๋””์–ด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์—ˆ๋Š”๋ฐ ์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ์…‹์ด ๋‚˜์™€์ฃผ๋‹ˆ ๊ฐ์‚ฌํ•  ๋”ฐ๋ฆ„์ด๋‹ค!

Multi Modal ํ™œ์šฉ

SKT AI Fellowship - ์ •์‹  ๊ฑด๊ฐ• ์ง„๋‹จ

๊ฐ€์žฅ ์ฒ˜์Œ Multi Modal์— ๋Œ€ํ•ด์„œ ์•Œ๊ฒŒ ๋œ ๊ณ„๊ธฐ์ด๋‹ค. ํ•™๊ธฐ ์ค‘(24๋…„ 1ํ•™๊ธฐ)์— ํ•ด๋ณผ ๋งŒํ•œ ๋Œ€์™ธํ™œ๋™์ด ์—†์„๊นŒ ์‹ถ์–ด ์ฐพ์•„๋ดค๋‹ค. ์„๋ฐ•์‚ฌ๊นŒ์ง€๋„ ๋ชจ์ง‘ ๋Œ€์ƒ์ด๊ณ  ํ•™๊ธฐ ์ค‘์ด์—ˆ์–ด์„œ ์ง€์›์€ ๋ชปํ–ˆ์ง€๋งŒ ์•Œ์•„๋ณด๋ฉฐ ํฌ์ŠคํŒ…๊นŒ์ง€ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ ๊ณ„๊ธฐ๊ฐ€ ๋˜์—ˆ๋‹ค.
์ •์‹  ๊ฑด๊ฐ• ์ง„๋‹จ ํ”„๋กœ์ ํŠธ๋Š” ์‹ค์ œ ์ƒ๋‹ด์„ ๋ชฉํ‘œ๋กœ ํ•˜๊ธฐ์— ์•ž์„œ ์„ค๋ช…ํ•œ Vision๊ณผ NLP๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ฒƒ์ด ์•„๋‹Œ, ์Œ์„ฑ ์‹ ํ˜ธ์™€ NLP๋ฅผ ๊ฒฐํ•ฉํ•œ ํ”„๋กœ์ ํŠธ์ด๋‹ค.

์ฐธ๊ณ  ๋งํฌ
Fellowship 6๊ธฐ ์—ฐ๊ตฌ ๊ณผ์ œ
Multi-modal ๊ฐ์ • ์ธ์‹ AI ๋ชจ๋ธ ๊ฐœ๋ฐœ - ์—ฐ๊ตฌ๊ณผ์ •(2)

๊ฐ„๋‹จํžˆ ๋‚ด์šฉ ์ •๋ฆฌ๋ฅผ ํ•˜์ž๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์œ„ ๋งํฌ๋ฅผ ํƒ€๊ณ  ๋“ค์–ด๊ฐ€๋ณด์„ธ์š”!

์—ฐ๊ตฌ ๋ชฉ์ 
์ŠคํŠธ๋ ˆ์Šค(์šฐ์šธ) ํ˜น์€ ๊ฐ์ • ์ƒํƒœ(๊ธฐ์จ, ์Šฌํ””, ๋ถ„๋…ธ, ํ˜์˜ค, ๋ถˆ์•ˆ ๋“ฑ)๋ฅผ ์Œ์„ฑ๊ณผ ์–ธ์–ด(์ƒ์ฒด์‹ ํ˜ธ) ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ํ™•๋ฅ ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์ด ๋ชฉ์ 

์—ฐ๊ตฌ ๊ณผ์ • ์˜ˆ์‹œ
๋ฐ์ดํ„ฐ: AI Hub์˜ Multi Modal(์˜์ƒ, ์Œ์„ฑ, ํ…์ŠคํŠธ) ๋ฐ์ดํ„ฐ์…‹์„ ํ™œ์šฉํ•˜์—ฌ Negative, Neutral, Positive์˜ ์„ธ ๊ฐ€์ง€ ํด๋ž˜์Šค๋กœ ๊ฐ์ • ์ธ์‹ ๋ถ„๋ฅ˜ ์ˆ˜ํ–‰
NLP: Rule-Based Approach๋กœ ์ „์ฒ˜๋ฆฌ ์ดํ›„ Labelingํ•˜๋Š” ๋ฐฉ์‹ ์‚ฌ์šฉ
Feature Extraction: ์Œ์„ฑ ๋ฐ์ดํ„ฐ โ†’ FFT ์ ์šฉ ํ›„ ์ฃผํŒŒ์ˆ˜ ๋ถ„์„์„ ํ†ตํ•œ ์–ต์–‘ ๋ถ„์„

mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

๋…ผ๋ฌธ ๋งํฌ
github ๋งํฌ
๋ฆฌ๋ทฐ ๋งํฌ
์ธ์šฉ ํšŸ์ˆ˜๊ฐ€ ๋ฌด๋ ค 143ํšŒ...

citation
Ye, Qinghao, et al. "mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration."ย Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

๊ธฐ์กด์˜ MLLM ๊ตฌ์ถ• ๋ฐฉ๋ฒ•
1. Vision ๋ฐ์ดํ„ฐ์—์„œ ์ถ”์ถœํ•œ ํŠน์ง•์„ pre-trained LLM์— alignํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋‹จ์ˆœํ•˜์ง€๋งŒ, ์ด๋ฏธ์ง€์—์„œ ์ถ”์ถœํ•œ ํŠน์ง• ๋ฒกํ„ฐ๋ฅผ LLM์— ๋ผ์›Œ๋„ฃ์„ ์ˆ˜ ์žˆ๋„๋ก ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ๋งŒ ๋งž์ถ”๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋กœ ๋‹ค๋ฅธ modality์˜ ํ˜‘์—…์ด ์ œํ•œ๋œ๋‹ค.
2. Vision Model์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ์ธ text embedding์„ pre-trained LLM์˜ ์ตœ์ข… ์„ ํ˜• ๋ ˆ์ด์–ด์— ์–น๋Š” ๋ฐฉ์‹ ๋˜ํ•œ ๋‹จ์ˆœํ•œ ๋ฐฉ์‹์ด์ง€๋งŒ, ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ ํŠน์ง• ๋˜๋Š” text embedding๋งŒ์„ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์˜ ๊ณ ์œ ํ•œ ํŠน์ง•์„ ๋†“์น  ์ˆ˜ ์žˆ๋‹ค.
3. Instruction Tuning๊ณผ ๊ฐ™์€ Fine-Tuning ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์€ Multi Modal task์—์„œ ์ค‘์š”ํ•œ ์„ฑ๋Šฅ์ธ ํŠน์ง• ์ถ”์ถœ ์„ฑ๋Šฅ์€ ํ–ฅ์ƒ๋ ์ง€๋ผ๋„, text generation task์˜ ์„ฑ๋Šฅ์€ ์ €ํ•˜๋  ์šฐ๋ ค๊ฐ€ ์žˆ์–ด ์ง€์–‘ํ•œ๋‹ค.
4. Vision Model์„ freezeํ•œ ํ›„ LLM๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ fine-tuning์„ ์ˆ˜ํ–‰ํ•  ๊ฒฝ์šฐ ๋ณต์žกํ•œ ์ด๋ฏธ์ง€์˜ high-level feature extraction(๊ฐ์ฒด ๊ฐ„ ๊ด€๊ณ„ ๋“ฑ) ์„ฑ๋Šฅ์ด ์ œํ•œ๋œ๋‹ค.

๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•œ ๋ชจ๋ธ ๊ตฌ์กฐ

Vision Encoder ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํŠน์ง• ์ถ”์ถœ
Visual Abstractor ๋ฐฐ๊ฒฝ, ๋…ธ์ด์ฆˆ, ์œ ์‚ฌํ•œ ํŒจ์น˜ ๋“ฑ ๋ถˆํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์ด๋ฏธ์ง€์˜ ํŠน์ง• ๋ฒกํ„ฐ ํฌ๊ธฐ ์••์ถ•
Text Embedding Vision Encoder์—์„œ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€์˜ text ์ฆ‰, label๊ณผ ๊ฒฐํ•ฉ
Language Decoder GPT, LLaMA์™€ ๊ฐ™์€ LLM์— ํ•จ๊ป˜ ์ž…๋ ฅ

Modality-Adaptive Module
๋ชจ๋“ˆ ๋‚ด์—์„œ Sinusoidal Encoding ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ Positional Encoding ํ›„ Self-Attention ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ฆ‰, LLM์˜ Decoder์™€ ๊ฐ™๋‹ค.
์ด๋ฏธ์ง€, ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์œผ๋กœ ์„ ํ˜• ์—ฐ์‚ฐ ํ›„ Layer Normalization์„ ์ ์šฉํ•œ๋‹ค.
๊ฐœ๋ณ„์ ์œผ๋กœ ๊ณ„์‚ฐํ•œ ๋‘ ๊ฐœ์˜ modality์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๊ฐ’์„ ํ•ฉํ•˜์—ฌ Query, Key, Value๋ฅผ ์ƒ์„ฑํ•˜๊ณ  Attention Score๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
Softmax, FFNN์˜ ๊ณผ์ •์€ Transformer์™€ ๋™์ผํ•˜๋‹ค.
๋‘ modality๊ฐ€ ๋™์ผํ•œ ์ˆ˜์šฉ ์˜์—ญ์œผ๋กœ projection๋˜์—ˆ์ง€๋งŒ, ๊ฐœ๋ณ„์ ์œผ๋กœ ์—ฐ์‚ฐํ•˜๋Š” ๊ณผ์ •์„ ํ†ตํ•ด ์„œ๋กœ ๊ฐ„์„ญ๋˜์ง€ ์•Š๊ณ  ๊ณ ์œ ์˜ ํŠน์ง•์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.

๋ณธ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆํ•œ ๋ชจ๋ธ์˜ ํ•™์Šต ๋ฐฉ๋ฒ•
Pre-Training
Pre-trained Language Decoder๋ฅผ freezeํ•˜๊ณ  Vision Encoder, Visual Abstractor, Text Embedding ๋ถ€๋ถ„์„ ํ•™์Šตํ•œ๋‹ค.
LLM ์ชฝ์€ freezeํ•˜๊ธฐ ๋•Œ๋ฌธ์— Vision Task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ถ€๋ถ„์ด Language Model์— ์ ์‘ํ•˜๋Š” ๋‹จ๊ณ„๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
์ด๋•Œ, Vision task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์€ Pre-trained Vision Encoder๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๋œ๋‹ค.
Instruction Tuning
Language Decoder ๋˜ํ•œ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“  ํ›„ ์ „๋ถ€ Instruction Tuning ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ Fine-Tuning์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

๊ฒฐ๊ณผ
description, question and answering ๋“ฑ ๋‹ค์–‘ํ•œ vision-lanauge ๋ถ„์•ผ์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.
์‹คํ—˜ ๊ณผ์ •, ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„์„์€ ์•„์ง ๋ชป ๋ด„ ใ… ใ… 

๊ทธ ์™ธ์—๋„โ€ฆ

profile
๐ŸŒˆ๐ŸŒผ๐ŸŒธโ˜€๏ธ

0๊ฐœ์˜ ๋Œ“๊ธ€