๐Ÿง  Vision-Language-Action (VLA)

๊น€๋ฏผ์ค€ยท2025๋…„ 7์›” 10์ผ
0

1. VLA ๊ฐœ๋… ์ •์˜

VLA ๋ชจ๋ธ์€ Vision, Language, Action ์„ธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ํ†ตํ•ฉํ•ด, ์‹œ๊ฐ-์–ธ์–ด ์ž…๋ ฅ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ง์ ‘ ๋ฌผ๋ฆฌ์  ํ–‰๋™์„ ์‹คํ–‰ํ•˜๋Š” ์ž„๋ฒ ๋””๋“œ ์ง€๋Šฅ ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค (arXiv).


2. ์ฃผ์š” ๊ตฌ์กฐ ๋ฐ ๊ตฌ์„ฑ ์š”์†Œ

2.1 Vision Encoder

  • ์ด๋ฏธ์ง€ ๋˜๋Š” ์˜์ƒ ํ”„๋ ˆ์ž„์„ ๊ณ ์ฐจ์› ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜
  • ViT, CLIP, ResNet ๊ธฐ๋ฐ˜ VLM ์‚ฌ์šฉ

2.2 Language Encoder

2.3 High-Level Policy (์ƒ์œ„ ์ •์ฑ…)

  • VLM์— ์‹œ๊ฐ+์–ธ์–ด ์ž…๋ ฅ๊ณผ **๋ฐ๋ชจ ํˆฌ์–ด ๋น„๋””์˜ค(long context)**๋ฅผ ํ•จ๊ป˜ ์ž…๋ ฅํ•˜์—ฌ ๋ชฉํ‘œ ํ”„๋ ˆ์ž„(goal) ์˜ˆ์ธก
  • MINT(Multimodal Instruction Navigation with Tours) ๊ณผ์ œ์—์„œ ํ•ต์‹ฌ ์—ญํ•  (Moonlight)

2.4 Low-Level Policy (ํ•˜์œ„ ์ •์ฑ…)

  • COLMAP๋ฅผ ํ†ตํ•ด ์ž๋™ ์ƒ์„ฑ๋œ ์œ„์ƒ ๊ทธ๋ž˜ํ”„(topological graph) ์‚ฌ์šฉ
  • Dijkstra ๋ฐฉ์‹์œผ๋กœ ์ตœ๋‹จ ๊ฒฝ๋กœ ๊ณ„ํšํ•˜๊ณ , ฮ”x, ฮ”y, ฮ”ฮธ ํ˜•ํƒœ์˜ ํ–‰๋™(wp) ์ˆ˜ํ–‰ (Moonlight)

2.5 ํ†ตํ•ฉ ๊ตฌ์กฐ ํ๋ฆ„

[๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ(ํ…์ŠคํŠธยท์ด๋ฏธ์ง€ยท์Œ์„ฑ) + ์‹œ์—ฐ ๋น„๋””์˜ค] 
โ†’ ์ƒ์œ„ ์ •์ฑ…(VLM ์ฒ˜๋ฆฌ) โ†’ ๋ชฉํ‘œ ํ”„๋ ˆ์ž„ ์ธ์‹ 
โ†’ ํ•˜์œ„ ์ •์ฑ…(์œ„์ƒ ๊ทธ๋ž˜ํ”„ ๊ธฐ๋ฐ˜ ๊ฒฝ๋กœ ์ƒ์„ฑ) 
โ†’ ๋ฌผ๋ฆฌ์  ์ด๋™/์กฐ์ž‘ ์ˆ˜ํ–‰

3. ๋Œ€ํ‘œ ๋ชจ๋ธ ๋ฐ ์ตœ์‹  ์—ฐ๊ตฌ ๋™ํ–ฅ

๐Ÿงญ Mobility VLA (Google DeepMind)

๐Ÿ“˜ RT-2 (Google DeepMind)

  • VLA์˜ ๋Œ€ํ‘œ ์‹คํ˜„ ๋ชจ๋ธ, Vision + Language โ†’ ํ–‰๋™์œผ๋กœ ๋™์ž‘ ์ง๊ฒฐ (์œ„ํ‚ค๋ฐฑ๊ณผ)

๐Ÿ› ๏ธ OpenVLA (Stanford ์™ธ)

  • 7B parameter, 970k robot ์‹œ์—ฐ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์˜คํ”ˆ์†Œ์Šค VLA
  • RTโ€‘2โ€‘X ๋Œ€๋น„ 16.5% ๋†’์€ ์„ฑ๊ณต๋ฅ , different robot architectures ์ง€์› (arXiv)

โšก TinyVLA

  • ๋น ๋ฅธ ์ถ”๋ก ๊ณผ ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ์„ ๊ฐ–์ถ˜ ๊ฒฝ๋Ÿ‰ VLA ๋ชจ๋ธ
  • ๋ถ„์‚ฐ/ํ™•์‚ฐ(decoder+a diffusion policy) ๊ตฌ์กฐ๋กœ real-world ์ œ์–ด๊นŒ์ง€ (arXiv)

๐Ÿง  OTTER

  • ํ…์ŠคํŠธ ์ง€์‹œ์–ด์— ๋งž๋Š” ์‹œ๊ฐ ํ”ผ์ณ๋งŒ ์„ ํƒํ•˜์—ฌ VLM์„ frozen ์ƒํƒœ๋กœ ํ™œ์šฉ ์ง€์นจ ์ง์ ‘ ์‹คํ–‰ ๊ฐ€๋Šฅ

๐Ÿค– Helix (Figure AI)

  • ํœด๋จธ๋…ธ์ด๋“œ ์ƒ์ฒด์™€ ์†๊ฐ€๋ฝ ์ œ์–ด, ๋‹ค์ค‘ ๋กœ๋ด‡ ํ˜‘์—…๊นŒ์ง€ ์ง€์›ํ•˜๋Š” ์ตœ์ดˆ ์‚ฌ๋ก€

4. ๊ธฐ์ˆ ์  ์ด์Šˆ ๋ฐ ํ•ด๊ฒฐ ์ „๋žต

  1. ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ •๋ ฌ & ๊ธด ์ปจํ…์ŠคํŠธ

    • ๊ธด ์‹œ์—ฐ ๋น„๋””์˜ค + VLM ์œตํ•ฉ์œผ๋กœ ๋ณต์žกํ•œ ์ž์—ฐ์–ด ์ถ”๋ก  ์‹คํ˜„ (๋”๋ฐ€ํฌ, Moonlight)
  2. ๊ณ„์ธต์  ์ •์ฑ… ๊ตฌ์„ฑ

    • ์ƒ์œ„: ๋ชฉํ‘œ ์ธ์‹, ํ•˜์œ„: ๊ฒฝ๋กœ ์ œ์–ด๋กœ ์—ญํ•  ๋ถ„ํ•  โ†’ ํšจ์œจ์„ฑ๊ณผ ์ •ํ™•๋„ ๊ทน๋Œ€ํ™”
  3. ์‹ค์„ธ๊ณ„ ์ ์šฉ ๋ฌธ์ œ (Sim2Real)

    • ์œ„์ƒ ๊ทธ๋ž˜ํ”„์™€ COLMAP ์‚ฌ์šฉ์œผ๋กœ ๋ณ€ํ™”์— ๊ฐ•์ธํ•œ ๊ฒฝ๋กœ ๊ณ„ํš
  4. ์ œ์–ด ์ฃผํŒŒ์ˆ˜ ๋ฐ ํ–‰๋™ ํ‘œํ˜„


5. ์‘์šฉ ๋ถ„์•ผ

  • ๋กœ๋ด‡ ๋‚ด๋น„๊ฒŒ์ด์…˜(Robotic Navigation)
  • ์„œ๋น„์Šค/๊ฐ€์ •์šฉ/๋ฌผ๋ฅ˜ ๋กœ๋ด‡
  • ํœด๋จธ๋…ธ์ด๋“œ ์ผ์ƒ ์ œ์–ด
  • ์‚ฐ์—… ์ž๋™ํ™”, ์ž์œจ ์ฃผํ–‰ ๋‚ด ํ–‰๋™ ๋ช…๋ น
  • HRI(Humanโ€“Robot Interaction), ์›๊ฒฉ ์ œ์–ด ๋ฐ ๊ต์œก

6. ์š”์•ฝ ๋น„๊ตํ‘œ

๋ชจ๋ธ๋ช…ํŒŒ๋ผ๋ฏธํ„ฐํ•ต์‹ฌ ํŠน์ง•
Mobility VLAGemini ๊ธฐ๋ฐ˜๊ธด ์ปจํ…์ŠคํŠธ VLM + ์œ„์ƒ ๊ทธ๋ž˜ํ”„ / ์‹ค๋‚ด ๋‚ด๋น„๊ฒŒ์ด์…˜ ํŠนํ™”
RTโ€‘2closedVLA ์ฒซ ์ƒ์šฉํ™”, ๋น„์ „ยท์–ธ์–ดโ†’ํ–‰๋™ ์ง์ ‘
OpenVLA7B, ๊ณต๊ฐœ๋‹ค์–‘ํ•œ ๋กœ๋ด‡์ง€์›, ์ œ๋กœ/์ œ๋กœ-์ƒท ๊ฐ•์ 
TinyVLA๊ฒฝ๋Ÿ‰ ๊ตฌ์กฐ์ถ”๋ก ์†๋„ ๋ฐ ๋ฐ์ดํ„ฐ ํšจ์œจ ํƒ์›”
OTTER๊ฐœ์„ ํ˜• VLM์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ ๊ฐ•ํ™”, ์ œ๋กœ ์ƒท ๊ฐ•๋ ฅ
Helix์ „์ฒด humanoid์ง‘์‚ฌ ์ˆ˜์ค€ ์กฐ์ž‘ ๋ฐ ํ˜‘์—…์ง€์›

7. ๋งˆ๋ฌด๋ฆฌ

VLA๋Š” ๋ณด๊ณ ยท์ดํ•ดํ•˜๊ณ  ํ–‰๋™ํ•˜๋Š” AI ๋กœ๋ด‡ ์‹œ๋Œ€๋ฅผ ์•ž๋‹น๊ธฐ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
ํŠนํžˆ Mobility VLA์™€ ๊ฐ™์€ ์‚ฌ๋ก€๋Š” ๊ธด ๋งฅ๋ฝ ์ถ”๋ก , ๊ณ„์ธต์  ์ •์ฑ…, ์œ„์ƒ ์ •๋ณด ํ™œ์šฉ์ด๋ผ๋Š” ํ†ตํ•ฉ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ๋กœ๋ด‡ ๋‚ด๋น„๊ฒŒ์ด์…˜ ์—ฐ๊ตฌ์— ๋ฌธ์„ ์—ด์—ˆ์ฃ .

ํ–ฅํ›„ ๋ฐฉํ–ฅ

  • Mobility VLA์˜ ์‚ฌ์ „ ์ฒ˜๋ฆฌ (COLMAP & ์œ„์ƒ ๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ) ์‹ค์Šต
  • ROS2 + RTโ€‘2 / OpenVLA ์—ฐ๋™ ๊ตฌ์„ฑ
  • TinyVLA ๊ธฐ๋ฐ˜ ๊ฒฝ๋Ÿ‰ํ™”๋œ ์‹ค์‹œ๊ฐ„ ํ–‰๋™ ์ƒ์„ฑ ๋ฐ๋ชจ ์ง„ํ–‰
profile
์ง€๊ธˆ๊นŒ์ง€ ํ•ด์˜จ ์—ฌ๋Ÿฌ ํ™œ๋™๋“ค์„ ๊ฐ„๋žตํ•˜๊ฒŒ๋ผ๋„ ์ •๋ฆฌํ•ด๋ณด๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€