AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

DD[Dev_Diary]ยท2025๋…„ 11์›” 22์ผ
post-thumbnail

๐Ÿ“ AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration


ํ•œ ์ค„ ์š”์•ฝ:
๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ๊ฐ€์ค‘์น˜๋ฅผ 4๋น„ํŠธ๋กœ ์–‘์žํ™”ํ•  ๋•Œ, ํ™œ์„ฑํ™”(activation) ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ค‘์š”ํ•œ ๊ฐ€์ค‘์น˜ ์ฑ„๋„์„ ๋ณดํ˜ธํ•จ์œผ๋กœ์จ ์ •ํ™•๋„ ์†์‹ค ์—†์ด 3๋ฐฐ ์ด์ƒ์˜ ์ถ”๋ก  ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•œ ์—ฐ๊ตฌ.


1. ์„œ๋ก  ๋ฐ ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ (Introduction)

์—ฐ๊ตฌ์˜ ํ•„์š”์„ฑ: ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„์ 

๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์€ ์ฑ—๋ด‡, ๊ฐ€์ƒ ๋น„์„œ, ์ž์œจ์ฃผํ–‰์ฐจ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ํ˜์‹ ์„ ๊ฐ€์ ธ์™”์ง€๋งŒ, ์ฒœ๋ฌธํ•™์ ์ธ ๋ชจ๋ธ ํฌ๊ธฐ๊ฐ€ ์˜จ๋””๋ฐ”์ด์Šค ๋ฐฐํฌ์˜ ์ตœ๋Œ€ ๊ฑธ๋ฆผ๋Œ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด:

  • GPT-3๋Š” 175B ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ FP16 ๊ธฐ์ค€ 350GB์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”๊ตฌ
  • ์ตœ์‹  B200 GPU๋„ 192GB ๋ฉ”๋ชจ๋ฆฌ์— ๋ถˆ๊ณผํ•˜์—ฌ, ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค๋Š” ๋งํ•  ๊ฒƒ๋„ ์—†์Œ
  • ๊ธฐ์กด ์–‘์žํ™” ๋ฐฉ๋ฒ•(GPTQ ๋“ฑ)์€ ๋ณด์ •(calibration) ๋ฐ์ดํ„ฐ์…‹์— ๊ณผ์ ํ•ฉ๋˜์–ด, ๋ฒ”์šฉ์„ฑ์ด ๋–จ์–ด์ง

ํŠนํžˆ Post-Training Quantization(PTQ) ๋ฐฉ์‹์˜ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ๋‹ค์Œ ๋ฌธ์ œ๋ฅผ ๊ฒช์—ˆ์Šต๋‹ˆ๋‹ค:

  1. GPTQ: 2์ฐจ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•œ ์˜ค๋ฅ˜ ๋ณด์ •(error compensation)์„ ์ˆ˜ํ–‰ํ•˜์ง€๋งŒ, ์žฌ๊ตฌ์„ฑ(reconstruction) ๊ณผ์ •์—์„œ ๋ณด์ • ๋ฐ์ดํ„ฐ์— ๊ณผ์ ํ•ฉ๋˜์–ด ๋„๋ฉ”์ธ ์™ธ(out-of-distribution) ์„ฑ๋Šฅ์ด ์ €ํ•˜
  2. Round-to-Nearest(RTN): ๋‹จ์ˆœ ๋ฐ˜์˜ฌ๋ฆผ ๋ฐฉ์‹์œผ๋กœ INT3/INT4 ์ €๋น„ํŠธ์—์„œ ์„ฑ๋Šฅ ๊ธ‰๋ฝ

์—ฐ๊ตฌ ๋ชฉํ‘œ: ์ƒˆ๋กœ์šด ์ ‘๊ทผ๋ฒ•์˜ ํ•„์š”์„ฑ

์ €์ž๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•ต์‹ฌ ํ†ต์ฐฐ(insight)์—์„œ ์ถœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค:

"LLM์˜ ๋ชจ๋“  ๊ฐ€์ค‘์น˜๊ฐ€ ๋™๋“ฑํ•˜๊ฒŒ ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค. ์†Œ์ˆ˜(0.1~1%)์˜ ํ•ต์‹ฌ(salient) ๊ฐ€์ค‘์น˜๋งŒ ๋ณดํ˜ธํ•ด๋„ ์–‘์žํ™” ์˜ค๋ฅ˜๋ฅผ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค."

๊ทธ๋Ÿฌ๋‚˜ ์ค‘์š”ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ํ˜ผํ•ฉ ์ •๋ฐ€๋„(mixed-precision)๋กœ ์œ ์ง€ํ•˜๋ฉด ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„์ด ๋น„ํšจ์œจ์ 

  • ํ™œ์„ฑํ™” ๋ถ„ํฌ(activation distribution)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ค‘์š” ์ฑ„๋„์„ ์‹๋ณ„
  • ์ฑ„๋„๋ณ„ ์Šค์ผ€์ผ๋ง(per-channel scaling)์œผ๋กœ ์ค‘์š” ๊ฐ€์ค‘์น˜๋ฅผ ๋ณดํ˜ธํ•˜๋˜, ์ „์ฒด๋ฅผ ๋™์ผ ๋น„ํŠธ๋กœ ์œ ์ง€(ํ•˜๋“œ์›จ์–ด ์นœํ™”์ )
  • ์—ญ์ „ํŒŒ๋‚˜ ์žฌ๊ตฌ์„ฑ ์—†์ด ์ž‘๋™ํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์šฐ์ˆ˜

2. ์ œ์•ˆ ๋ฐฉ๋ฒ•๋ก  (Methodology) - ๋งค์šฐ ์ƒ์„ธํ•˜๊ฒŒ

ํ•ต์‹ฌ ์•„์ด๋””์–ด: ํ™œ์„ฑํ™” ์ธ์ง€(Activation-aware) ์–‘์žํ™”

AWQ์˜ ํ•ต์‹ฌ์€ "๊ฐ€์ค‘์น˜์˜ ์ค‘์š”๋„๋Š” ๊ฐ€์ค‘์น˜ ์ž์ฒด์˜ ํฌ๊ธฐ๊ฐ€ ์•„๋‹Œ, ํ•ด๋‹น ์ฑ„๋„์„ ํ†ต๊ณผํ•˜๋Š” ํ™œ์„ฑํ™”์˜ ํฌ๊ธฐ์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค"๋Š” ์›๋ฆฌ

1๋‹จ๊ณ„: ์ค‘์š” ๊ฐ€์ค‘์น˜ ์ฑ„๋„ ์‹๋ณ„

๋…ผ๋ฌธ์˜ Table 1 ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด:

๋ชจ๋ธRTN (w3-g128)ํ™œ์„ฑํ™” ๊ธฐ๋ฐ˜ 1% FP16๊ฐ€์ค‘์น˜ ๊ธฐ๋ฐ˜ 1% FP16๋žœ๋ค 1% FP16
OPT-6.7B23.54 PPL11.39 PPL22.37 PPL23.54 PPL
  • ํ™œ์„ฑํ™” ๋ถ„ํฌ ๊ธฐ๋ฐ˜์œผ๋กœ ์„ ํƒ๋œ 1%์˜ ์ฑ„๋„๋งŒ FP16์œผ๋กœ ์œ ์ง€ํ–ˆ์„ ๋•Œ, Perplexity๊ฐ€ 23.54 โ†’ 11.39๋กœ ๊ธ‰๊ฐ (์„ฑ๋Šฅ ๋Œ€ํญ ๊ฐœ์„ )
  • ๋ฐ˜๋ฉด ๊ฐ€์ค‘์น˜ ํฌ๊ธฐ(L2-norm) ๊ธฐ๋ฐ˜ ์„ ํƒ์ด๋‚˜ ๋žœ๋ค ์„ ํƒ์€ ๊ฑฐ์˜ ํšจ๊ณผ ์—†์Œ

ํ•ด์„: ํ™œ์„ฑํ™” ๊ฐ’์ด ํฐ ์ฑ„๋„์€ ๋” ์ค‘์š”ํ•œ ํŠน์ง•(feature)์„ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ํ•ด๋‹น ๊ฐ€์ค‘์น˜๋ฅผ ์ •๋ฐ€ํ•˜๊ฒŒ ์œ ์ง€

2๋‹จ๊ณ„: ์Šค์ผ€์ผ๋ง์„ ํ†ตํ•œ ์–‘์žํ™” ์˜ค๋ฅ˜ ๊ฐ์†Œ

ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋Š” ํ•˜๋“œ์›จ์–ด ๊ตฌํ˜„์ด ๋ณต์žกํ•˜๋ฏ€๋กœ, AWQ๋Š” ์ˆ˜ํ•™์ ์œผ๋กœ ๋™๋“ฑํ•œ ๋ณ€ํ™˜(equivalent transformation)์„ ํ™œ์šฉ

์–‘์žํ™” ํ•จ์ˆ˜:

Q(w) = ฮ” ยท Round(w/ฮ”), ์—ฌ๊ธฐ์„œ ฮ” = max(|w|) / (2^(N-1))

(N: ์–‘์žํ™” ๋น„ํŠธ ์ˆ˜, ฮ”: ์Šค์ผ€์ผ๋Ÿฌ)

ํŠน์ • ๊ฐ€์ค‘์น˜ w๋ฅผ s๋ฐฐ(s>1) ์Šค์ผ€์ผ์—…ํ•˜๊ณ , ์ž…๋ ฅ ํ™œ์„ฑํ™” x๋ฅผ 1/s๋กœ ์Šค์ผ€์ผ๋‹ค์šดํ•˜๋ฉด:

Q(wยทs) ยท (x/s) = ฮ”' ยท Round(ws/ฮ”') ยท x ยท (1/s)

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ (Table 2 ์‹คํ—˜):

s ๊ฐ’ฮ” ๋ณ€ํ™” ๋น„์œจํ‰๊ท  ์˜ค๋ฅ˜ ๊ฐ์†Œ์œจWiki-2 PPL
1.00%1.023.54
2.08.2%0.51911.92
4.021.2%0.30312.36
  • s=2์ผ ๋•Œ, ์ค‘์š” ์ฑ„๋„์˜ ์ƒ๋Œ€ ์–‘์žํ™” ์˜ค๋ฅ˜๊ฐ€ ์•ฝ ์ ˆ๋ฐ˜์œผ๋กœ ๊ฐ์†Œ
  • s๊ฐ€ ๋„ˆ๋ฌด ํฌ๋ฉด(s=4) ๋น„์ค‘์š” ์ฑ„๋„์˜ ฮ”๊ฐ€ ์ฆ๊ฐ€ํ•˜์—ฌ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ์ €ํ•˜

์ง๊ด€์  ์ดํ•ด:
์ค‘์š”ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ํฌ๊ฒŒ ๋งŒ๋“ค๋ฉด(s๋ฐฐ), ์–‘์žํ™” ์Šคํ…(ฮ”)์€ ๊ฑฐ์˜ ๋ณ€ํ•˜์ง€ ์•Š์ง€๋งŒ, ๋ฐ˜์˜ฌ๋ฆผ ์˜ค์ฐจ๋Š” ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์•„์ง‘๋‹ˆ๋‹ค. ๋งˆ์น˜ ์ž‘์€ ๋ฌผ์ฒด๋ฅผ ํ™•๋Œ€ํ•œ ํ›„ ๋””์ง€ํ„ธํ™”ํ•˜๋ฉด ๋””ํ…Œ์ผ์ด ๋” ์ž˜ ๋ณด์กด๋˜๋Š” ์›๋ฆฌ์™€ ์œ ์‚ฌ

Step-by-Step ํ”„๋กœ์„ธ์Šค

1. ๋ณด์ • ๋ฐ์ดํ„ฐ์…‹(calibration set)์—์„œ ๊ฐ ์ฑ„๋„๋ณ„ ํ™œ์„ฑํ™” ํ‰๊ท  ํฌ๊ธฐ(sโ‚“) ์ธก์ •
2. ์ตœ์  ์Šค์ผ€์ผ s = sโ‚“^ฮฑ ํ˜•ํƒœ๋กœ ํƒ์ƒ‰ ๊ณต๊ฐ„ ์„ค์ •
3. ฮฑ โˆˆ [0, 1] ๋ฒ”์œ„์—์„œ ๊ทธ๋ฆฌ๋“œ ์„œ์น˜(20๋‹จ๊ณ„)๋กœ ์ตœ์  ฮฑ ์ฐพ๊ธฐ
   - ๋ชฉํ‘œ: ||Q(Wยทdiag(s))ยท(diag(s)โปยนยทX) - WX|| ์ตœ์†Œํ™”
4. ์ฐพ์•„์ง„ ์Šค์ผ€์ผ๋กœ ๊ฐ€์ค‘์น˜ ๋ณ€ํ™˜ ํ›„ ์–‘์žํ™”
5. ์ถ”๋ก  ์‹œ sโปยนยทX๋Š” ์ด์ „ ๋ ˆ์ด์–ด ์—ฐ์‚ฐ์— ์œตํ•ฉ(fuse) ๊ฐ€๋Šฅ

์žฅ์ :

  • ์—ญ์ „ํŒŒ ๋ถˆํ•„์š” โ†’ ์—ฐ์‚ฐ ํšจ์œจ์ 
  • ๋ณด์ • ๋ฐ์ดํ„ฐ ์˜์กด๋„ ๋‚ฎ์Œ โ†’ ์ผ๋ฐ˜ํ™” ์šฐ์ˆ˜
  • ํ•˜๋“œ์›จ์–ด ์นœํ™”์  (๋‹จ์ผ ์ •๋ฐ€๋„ ์œ ์ง€)

3. ์ฃผ์š” ์‹คํ—˜ ๊ฒฐ๊ณผ (Experiments & Results)

LLaMA/Llama-2 ๋ชจ๋ธ ์„ฑ๋Šฅ ๋น„๊ต

๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ ๊ฒฐ๊ณผ ํ…Œ์ด๋ธ”:

๋ชจ๋ธ ํฌ๊ธฐFP16RTN (INT3)GPTQGPTQ-RAWQ
Llama-2 7B5.476.666.436.426.24
Llama-2 70B3.323.983.883.863.74
LLaMA 7B5.687.018.816.536.35

์‹œ๊ฐ์  ๋‚ด์šฉ (WikiText-2 Perplexity, ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ):

  • X์ถ•: ๋ชจ๋ธ ํฌ๊ธฐ (7B ~ 70B)
  • Y์ถ•: Perplexity ์ˆ˜์น˜
  • AWQ(์ฃผํ™ฉ์„ )๊ฐ€ ๋ชจ๋“  ๋ชจ๋ธ์—์„œ RTN, GPTQ๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ๋‚ฎ์€ PPL ๋‹ฌ์„ฑ

ํ•ด์„:
1. INT3 ์–‘์žํ™”์—์„œ AWQ๋Š” RTN ๋Œ€๋น„ 7B ๋ชจ๋ธ์—์„œ 6.66โ†’6.24 (6.3% ๊ฐœ์„ ), GPTQ๋ณด๋‹ค๋„ ์šฐ์ˆ˜
2. ํŠนํžˆ LLaMA 7B์—์„œ GPTQ๋Š” 8.81๋กœ ์‹คํŒจํ–ˆ์œผ๋‚˜, AWQ๋Š” 6.35๋กœ ์•ˆ์ •์ 
3. 70B ์ดˆ๋Œ€ํ˜• ๋ชจ๋ธ์—์„œ๋„ FP16 3.32 ๋Œ€๋น„ AWQ๋Š” 3.74๋กœ ์†์‹ค ์ตœ์†Œํ™”

Instruction-tuned ๋ชจ๋ธ(Vicuna) GPT-4 ํ‰๊ฐ€

์‹œ๊ฐ์  ๋‚ด์šฉ:

  • 80๊ฐœ ์ƒ˜ํ”Œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์–‘์žํ™” ๋ชจ๋ธ vs FP16 ์‘๋‹ต์„ GPT-4๊ฐ€ ํ‰๊ฐ€
  • ํŒŒ๋ž€์ƒ‰(Quantized Win): ์–‘์žํ™” ๋ชจ๋ธ์ด ๋” ์ข‹์€ ๋‹ต๋ณ€
  • ํšŒ์ƒ‰(Tie): ๋™๋“ฑ
  • ๋นจ๊ฐ„์ƒ‰(Quantized Lost): ์–‘์žํ™” ๋ชจ๋ธ์ด ๋‚˜์œ ๋‹ต๋ณ€
๋ชจ๋ธRTN WinGPTQ WinAWQ Win
Vicuna-7B527175
Vicuna-13B475757 (๋™๋ฅ  ์ตœ๊ณ )

ํ•ด์„:
AWQ๋Š” instruction-tuned ๋ชจ๋ธ์—์„œ๋„ ๊ฐ€์žฅ ๋งŽ์€ ์Šน๋ฆฌ ์ผ€์ด์Šค๋ฅผ ๊ธฐ๋กํ•˜์—ฌ, ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚จ์„ ์ž…์ฆ

๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ OpenFlamingo-9B (COCO Captioning)

Few-shotFP16RTN (INT4)GPTQAWQ
32-shot81.70 CIDEr77.13 (-4.57)74.98 (-6.72)80.53 (-1.17)
0-shot63.7360.2459.7262.57

์‹œ๊ฐ์  ๋‚ด์šฉ:
๊ทธ๋ž˜ํ”„์—์„œ AWQ(์ฃผํ™ฉ์„ )๋Š” ๋ชจ๋“  few-shot ์„ค์ •(0/4/8/16/32-shot)์—์„œ RTN, GPTQ๋ณด๋‹ค FP16์— ๊ทผ์ ‘

ํ•ด์„:

  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์˜ ์ฒซ ์ €๋น„ํŠธ ์–‘์žํ™” ์„ฑ๊ณต ์‚ฌ๋ก€
  • 32-shot์—์„œ AWQ๋Š” FP16 ๋Œ€๋น„ ๋‹จ 1.17 CIDEr ๊ฐ์†Œ๋กœ, ๊ฑฐ์˜ ๋ฌด์†์‹ค ์ˆ˜์ค€
  • GPTQ๋Š” 6.72 ํ•˜๋ฝํ•˜์—ฌ ๊ณผ์ ํ•ฉ ๋ฌธ์ œ ๋…ธ์ถœ

๋ณ‘๋ชฉ ํ˜„์ƒ ๋ถ„์„ (RTX 4090 GPU)

์™ผ์ชฝ ๊ทธ๋ž˜ํ”„: Context vs Generation ์‹œ๊ฐ„

  • Context ๋‹จ๊ณ„(200 ํ† ํฐ): 10ms
  • Generation ๋‹จ๊ณ„(20 ํ† ํฐ): 310ms โ†’ ์ƒ์„ฑ ๋‹จ๊ณ„๊ฐ€ 31๋ฐฐ ๋А๋ฆผ

์ค‘๊ฐ„ ๊ทธ๋ž˜ํ”„: Roofline ๋ถ„์„

  • Y์ถ•: Peak TFLOPS (์ตœ๋Œ€ 165)
  • X์ถ•: Arithmetic Intensity (์—ฐ์‚ฐ/๋ฉ”๋ชจ๋ฆฌ ๋น„์œจ)
  • FP16 Generation: Intensity=1 โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ
  • AWQ W4A16: Intensity=4 โ†’ 4๋ฐฐ ๊ฐœ์„ ์œผ๋กœ 4 TFLOPS ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ

์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋น„์ค‘

  • Weight ์ ‘๊ทผ: 134MB (์••๋„์ )
  • Activation ์ ‘๊ทผ: 1.7MB
  • ๊ฐ€์ค‘์น˜ ์ ‘๊ทผ์ด 79๋ฐฐ ๋งŽ์Œ โ†’ ๊ฐ€์ค‘์น˜ ์••์ถ•์ด ํ•ต์‹ฌ

ํ•ด์„:
์˜จ๋””๋ฐ”์ด์Šค LLM์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ(memory bandwidth)์— ์˜ํ•ด ์„ฑ๋Šฅ์ด ์ œํ•œ๋ฉ๋‹ˆ๋‹ค. AWQ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ 4๋น„ํŠธ๋กœ ์••์ถ•ํ•˜์—ฌ ์ด๋ก ์ƒ 4๋ฐฐ ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ โ†’ ์‹ค์ œ 3๋ฐฐ ์ด์ƒ ์†๋„ ํ–ฅ์ƒ ๊ฐ€๋Šฅ.

TinyChat ์‹œ์Šคํ…œ ์‹ค์ธก ์„ฑ๋Šฅ

RTX 4090 Desktop GPU:

๋ชจ๋ธHuggingface FP16TinyChat FP16TinyChat AWQ (W4A16)
Llama-2-7B52 tok/s62 tok/s194 tok/s
Llama-2-13B49 tok/s-158 tok/s
Falcon-7B124 tok/s-194 tok/s

Jetson Orin Mobile GPU:

๋ชจ๋ธHuggingface FP16TinyChat AWQ
Llama-2-7B22 tok/s38 tok/s
Llama-2-13BOOM21 tok/s

ํ•ด์„:
1. Desktop GPU: AWQ+TinyChat์€ Huggingface FP16 ๋Œ€๋น„ 3.1~3.9๋ฐฐ ์†๋„ ํ–ฅ์ƒ
2. Mobile GPU: 13B ๋ชจ๋ธ์ด FP16์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ณผ(OOM)์ง€๋งŒ, AWQ๋กœ 21 tok/s ๋‹ฌ์„ฑ
3. 8GB ๋ฉ”๋ชจ๋ฆฌ ๋…ธํŠธ๋ถ(RTX 4070)์—์„œ๋„ Llama-2-13B๋ฅผ 33 tok/s๋กœ ๊ตฌ๋™ ๊ฐ€๋Šฅ

์‹œ๊ฐ์  ๋‚ด์šฉ (Figure 9 ๋ง‰๋Œ€ ๊ทธ๋ž˜ํ”„):

  • ํŒŒ๋ž€์ƒ‰(Huggingface FP16): ๋‚ฎ์€ ๋ง‰๋Œ€
  • ํšŒ์ƒ‰(TinyChat FP16): ์ค‘๊ฐ„ ๋ง‰๋Œ€
  • ๋นจ๊ฐ„์ƒ‰(TinyChat AWQ): ๊ฐ€์žฅ ๋†’์€ ๋ง‰๋Œ€ โ†’ ์‹œ๊ฐ์ ์œผ๋กœ๋„ ์••๋„์  ์šฐ์œ„

5. ๊ฒฐ๋ก  ๋ฐ ์ธ์‚ฌ์ดํŠธ (Conclusion & Insight)

ํ•ต์‹ฌ ๊ธฐ์—ฌ(Contribution) 3๊ฐ€์ง€

  1. ํ™œ์„ฑํ™” ์ธ์ง€ ์–‘์žํ™” ์›๋ฆฌ ์ •๋ฆฝ
    ๊ฐ€์ค‘์น˜ ์ค‘์š”๋„๋ฅผ ํ™œ์„ฑํ™” ๋ถ„ํฌ๋กœ ํŒ๋‹จํ•˜๋Š” ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„ ์ œ์‹œ. ์ด๋Š” GPTQ์˜ ์žฌ๊ตฌ์„ฑ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ณด๋‹ค ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚จ์„ ์‹คํ—˜์ ์œผ๋กœ ์ž…์ฆ

  2. ํ•˜๋“œ์›จ์–ด ์นœํ™”์  ์„ค๊ณ„
    ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ์—†์ด ๋‹จ์ผ ๋น„ํŠธ ์–‘์žํ™” + ์ฑ„๋„ ์Šค์ผ€์ผ๋ง์œผ๋กœ ๋™๋“ฑํ•œ ํšจ๊ณผ ๋‹ฌ์„ฑ. ์ด๋Š” CUDA ์ปค๋„ ๊ตฌํ˜„์„ ๋‹จ์ˆœํ™”ํ•˜์—ฌ ์‹ค์ œ ๋ฐฐํฌ ๊ฐ€๋Šฅ์„ฑ์„ ๋†’์ž„

  3. ๋ฒ”์šฉ์„ฑ ํ™•์žฅ

    • Instruction-tuned ๋ชจ๋ธ(Vicuna)
    • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ(OpenFlamingo, VILA, LLaVA)
    • ์ฝ”๋”ฉ/์ˆ˜ํ•™ ํŠนํ™” ๋ชจ๋ธ(CodeLlama, GSM8K)
      ๋ชจ๋‘์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ โ†’ ์ฒซ ๋ฒ”์šฉ ์ €๋น„ํŠธ ์–‘์žํ™” ์†”๋ฃจ์…˜

ํ•œ๊ณ„์  ๋ฐ ํ–ฅํ›„ ๊ณผ์ œ

์ €์ž๊ฐ€ ๋ฐํžŒ ํ•œ๊ณ„:

  • INT2 ๊ทน์ €๋น„ํŠธ์—์„œ๋Š” ์—ฌ์ „ํžˆ ์„ฑ๋Šฅ ์ €ํ•˜ ๋ฐœ์ƒ (Table 9: RTN ์™„์ „ ์‹คํŒจ, AWQ+GPTQ ์กฐํ•ฉ ํ•„์š”)
  • ๋ณด์ • ๋ฐ์ดํ„ฐ ์˜์กด๋„๊ฐ€ ๋‚ฎ์ง€๋งŒ, ์™„์ „ํžˆ ์ œ๋กœ๋Š” ์•„๋‹˜

5. ์ฐธ๊ณ  ๋ฌธํ—Œ ๋ฐ ๋งํฌ (References)

๋…ผ๋ฌธ ๋งํฌ:

๐Ÿ’ก ๋งˆ๋ฌด๋ฆฌ ์ฝ”๋ฉ˜ํŠธ

AWQ๋Š” "LLM ์–‘์žํ™”์˜ JPEG ์••์ถ•"์ด๋ผ ํ•  ๋งŒํ•ฉ๋‹ˆ๋‹ค. JPEG์ด ์ด๋ฏธ์ง€์—์„œ ์ธ๊ฐ„ ๋ˆˆ์— ๋œ ์ค‘์š”ํ•œ ๊ณ ์ฃผํŒŒ ์„ฑ๋ถ„์„ ์ œ๊ฑฐํ•˜๋“ฏ, AWQ๋Š” ํ™œ์„ฑํ™” ๋ถ„ํฌ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋œ ์ค‘์š”ํ•œ ๊ฐ€์ค‘์น˜์˜ ์ •๋ฐ€๋„๋ฅผ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค.

ํŠนํžˆ ์ธ์ƒ ๊นŠ์—ˆ๋˜ ์ ์€ ์ด๋ก (์ˆ˜์‹ ์œ ๋„)๊ณผ ์‹ค๋ฌด(TinyChat ์‹œ์Šคํ…œ)๋ฅผ ์™„๋ฒฝํžˆ ์—ฐ๊ฒฐํ•œ ์—ฐ๊ตฌ ์„ค๊ณ„์ž…๋‹ˆ๋‹ค. ๋งŽ์€ ๋…ผ๋ฌธ์ด "์ด๋ก ์ƒ ๊ฐ€๋Šฅ"์— ๊ทธ์น˜๋Š” ๋ฐ˜๋ฉด, AWQ๋Š” ์‹ค์ œ GPU์—์„œ 3๋ฐฐ ์†๋„ ํ–ฅ์ƒ์„ ์‹ค์ธกํ•˜์—ฌ ์ฆ‰์‹œ ๋„์ž… ๊ฐ€๋Šฅํ•œ ์†”๋ฃจ์…˜์ž„์„ ์ฆ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค.

2024๋…„ ๊ธฐ์ค€, ์˜จ๋””๋ฐ”์ด์Šค AI๊ฐ€ ๋Œ€์„ธ๋กœ ๋– ์˜ค๋ฅด๋Š” ์‹œ์ ์—์„œ ์ด ์—ฐ๊ตฌ๋Š” ๋ชจ๋ฐ”์ผ LLM ํ˜๋ช…์˜ ํ•ต์‹ฌ ๊ธฐ์ˆ ๋กœ ์ž๋ฆฌ๋งค๊น€ํ•  ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค. ๐Ÿš€

profile
AI๋กœ ์œ ์šฉํ•œ ์„œ๋น„์Šค ๊ฐœ๋ฐœ์„ ๊ฟˆ๊พธ๋Š” A๋ฆฐ์ด

0๊ฐœ์˜ ๋Œ“๊ธ€