[๐Ÿ“–๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] LoRA: Low-Rank Adaptation of Large Language Models (2021)

Becky's Study Labยท2023๋…„ 12์›” 27์ผ
0

PaperReview

๋ชฉ๋ก ๋ณด๊ธฐ
12/22

FinGPT ๋…ผ๋ฌธ์„ ์ฝ๋˜ ์ค‘, ๋งค์ผ๋งค์ผ ์Ÿ์•„์ ธ๋‚˜์˜ค๋Š” ๊ธˆ์œต ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  Fine-tuning์„ ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ LoRA๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด๊ณ  LoRA๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ๋งŒ ๋‚ด๊ฐ€ ์•Œ๊ณ  ์žˆ๋‹ค๋Š” ์ƒ๊ฐ์„ ํ–ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์ด๋ ‡๊ฒŒ LoRA ๋…ผ๋ฌธ์„ ์ฐพ์•„๋ณด๊ฒŒ๋˜์—ˆ๋‹ค. ์ฐธ๊ณ ๋กœ Microsoft์—์„œ ๋‚˜์˜จ ๋…ผ๋ฌธ์ธ๋ฐ ์ฝ๊ณ  ๋‚˜๋‹ˆ ์ด ๋…ผ๋ฌธ์ด ๊ฝค ๋Œ€๋‹จํ•œ ๋‚ด์šฉ์ด๋ผ๋Š” ์ ์„ ๋‹ค์‹œ ํ•œ ๋ฒˆ ๊นจ๋‹ฌ์•˜๋‹ค.

0. Abstract

  • LLM๋ชจ๋ธ์€ ์ผ๋ฐ˜ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋Œ€๊ทœ๋ชจ pre-training๊ณผ ํŠน์ • ์ž‘์—… ๋˜๋Š” ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ fine-tuning์œผ๋กœ ์ด๋ค„์ง
  • ex) GPT-3 175B๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ๊ฐ 175B ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ๋Š” ๋ฏธ์„ธ ์กฐ์ • ๋ชจ๋ธ์˜ ๋…๋ฆฝ์ ์ธ ์ธ์Šคํ„ด์Šค๋ฅผ ๋ฐฐํฌํ•˜๋Š” ๋ฐ ๋น„์šฉ์ด ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค.
  • ์‚ฌ์ „ ํ•™์Šต ๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ๋ฅผ ๋™๊ฒฐํ•˜๊ณ  ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ˆœ์œ„ ๋ถ„ํ•ด ํ–‰๋ ฌ์„ Transformer ์•„ํ‚คํ…์ฒ˜์˜ ๊ฐ ๊ณ„์ธต์— ์ฃผ์ž…ํ•˜์—ฌ ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์— ๋Œ€ํ•ด ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜๋ฅผ ํฌ๊ฒŒ ์ค„์ด๋Š” LoRA(Low-Rank A ์ ์‘)๋ฅผ ์ œ์•ˆ
  • LoRA๋Š” ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ˆ˜๋ฅผ 10,000๋ฐฐ, GPU ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์„ 3๋ฐฐ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ ์„ฑ๋Šฅ ๋ฉด์—์„œ๋„ RoBERTa, DeBERTa, GPT-2 ๋ฐ GPT-3์˜ ๋ชจ๋ธ ํ’ˆ์งˆ์—์„œ ๋ฏธ์„ธ ์กฐ์ •๋ณด๋‹ค ๋™๋“ฑํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ•จ.

1. Introduction

๐Ÿค” LLM์€ ๊ธฐ๋ณธ์ ์œผ๋กœ pre-trained model๋กœ๋ถ€ํ„ฐ ํŠน์ • task(e.g. summarization, question and answering, ...)์— adaptationํ•˜๊ธฐ ์œ„ํ•ด fine-tuning์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Fine-tuning์„ ํ•˜๋ฉด์„œ LLM๋ชจ๋ธ์˜ weight parameters๋ฅผ ๋ชจ๋‘ ๋‹ค์‹œ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋Š”๋ฐ ์ด๋•Œ ์—„์ฒญ๋‚œ ๋น„์šฉ์ด ๋ฐœ์ƒํ•œ๋‹ค.
(์˜ˆ๋ฅผ ๋“ค์–ด GPT-2(or 3), RoBERTa large๋ชจ๋ธ์˜ ๊ฒฝ์šฐ fine-tuning๋งŒ ๋ช‡ ๋‹ฌ์ด ๊ฑธ๋ฆฐ๋‹ค.)

โ— ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” Low-Rank Adaptation(LoRA)๋ฅผ ์ œ์•ˆ

LoRA์˜ overview ์ด๋ฏธ์ง€์ด๋‹ค.
์ด ์ด๋ฏธ์ง€์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ๊ธฐ์กด์˜ weights ๋Œ€์‹  ์ƒˆ๋กœ์šด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ด์šฉํ•ด์„œ ๋™์ผํ•œ ์„ฑ๋Šฅ๊ณผ ๋” ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ํŠœ๋‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜๊ณ  ์žˆ๋‹ค.
LoRA๋Š” Low-Rank ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ time, resource cost๋ฅผ ์ค„์ด๊ฒŒ ๋œ๋‹ค.

(๐Ÿ”ป๊ฐ„๋‹จํ•œ ์„ค๋ช…)
์œ„ Figure 1๊ณผ ๊ฐ™์ด fine-tuning์‹œ์— pre-trained weights W๋Š” frozenํ•ด๋‘๊ณ  low rank decomposition๋œ weights A, B๋งŒ ํ•™์Šตํ•˜๊ณ  W์— summationํ•˜๊ฒŒ ๋œ๋‹ค. Low rank๋กœ decomposition๋œ weights๋Š” ๋‹น์—ฐํ•˜๊ฒŒ๋„ ๊ธฐ์กด W๋ณด๋‹ค ํ›จ์”ฌ ์ž‘์€ ํฌ๊ธฐ์˜ weight์ด๊ธฐ ๋•Œ๋ฌธ์— time, resource cost๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

๋˜ํ•œ pre-trained model์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์ƒํƒœ์—์„œ ํŠน์ • task์— adaptationํ•˜๊ธฐ ์œ„ํ•ด์„œ A์™€ B๋งŒ storage์— ์ €์žฅํ•˜๋ฉด ๋˜๊ณ  ๋‹ค๋ฅธ task์— adaptationํ•˜๊ธฐ ์œ„ํ•ด ๋˜ ๋‹ค๋ฅธ A', B'๋งŒ ๊ฐˆ์•„ ๋ผ์šฐ๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— storage, task switching๋ฉด์—์„œ ๋งค์šฐ ํšจ์œจ์ ์ด๋‹ค. ์ถ”๊ฐ€์ ์œผ๋กœ inference์‹œ์—๋„ fine-tuned model์˜ latency์„ฑ๋Šฅ์ด ๋‚ฎ์•„์ง€์ง€๋„ ์•Š๋Š”๋‹ค.

๐Ÿ’กLow-Rank ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ?

๐Ÿ“‘"Measuring the Intrinsic Dimension of Objective Landscapes"๋…ผ๋ฌธ, ๐Ÿ“‘"Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning"๋…ผ๋ฌธ์—์„œ
โ— over-parameterized model์€ low intrinsic dimension์œผ๋กœ ์กด์žฌํ•˜๊ณ  ์žˆ๋‹คโ—
-> model adaptation๋™์•ˆ์˜ weight change์—๋„ low intrinsic rank๋ฅผ ๊ฐ€์งˆ ๊ฑฐ๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ฒŒ ๋˜๊ณ  Low-Rank ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ

LoRA๋Š” ๊ธฐ์กด pre-trained weights๋Š” frozenํ•ด๋‘๊ณ  ๋ช‡ ๊ฐœ์˜ dense(fc) layers๋งŒ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ธ๋ฐ ์ด๋•Œ ํ•™์Šต๋ฐฉ๋ฒ•์ด dense layer์˜ weight์„ low rank๋กœ decompositionํ•œ matrices๋งŒ์„ optimizationํ•˜๋Š” ๊ฒƒ!

โœ… Terminologies and Conventions

2. Problem Statement

โœ… LoRA๋Š” training objective์™€ ์ƒ๊ด€์—†์ด ๋ชจ๋‘ ์‚ฌ์šฉ ๊ฐ€๋Šฅ(Agnostic)ํ•˜์ง€๋งŒ, LLM์—์„œ์˜ ์˜ˆ์‹œ๋กœ LoRA๋ฅผ ์„ค๋ช…ํ•œ๋‹ค.

  1. ๊ธฐ์กด์˜ LLM๋ชจ๋ธ(GPT)์˜ ํ™•๋ฅ ํ•จ์ˆ˜๋ฅผ Pฮฆ(yโˆฃx)P_ฮฆ(y|x)๋กœ ์ •์˜ํ•œ๋‹ค. (y์™€ x๋Š” context-target pair์Œ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ํŽธํ•จ. Pฮฆ(yโˆฃx)P_ฮฆ(y|x) ๋Š” GPT๊ฐ™์€ multi-task learner ๋ชจ๋ธ์˜ ํ™•๋ฅ ํ•จ์ˆ˜์ด๋‹ค.)

  2. ๊ทธ๋ฆฌ๊ณ  fine-tuning๊ณผ์ •์—์„œ LLM์ด ํŠœ๋‹๋˜๋Š” ฮฆ๊ฐ€ ์ตœ์ ํ™” ๋˜๋Š” ์‹์€ ์•„๋ž˜ ์‹์ฒ˜๋Ÿผ ํ‘œํ˜„ ๋  ์ˆ˜ ์žˆ๋‹ค.

    - ฮฆ0ฮฆ_0 : pre-trained weights
    - ๊ธฐ์กด์˜ full fine-tuning model์€ pre-trained model weights ฮฆ0ฮฆ_0๋กœ initialized ๋˜๊ณ , ์œ„์˜ Log-likelihood function์„ ์ตœ๋Œ€ํ•˜ํ•˜๊ณ ์ž ฮฆ0+ฮ”ฮฆฮฆ_0 + ฮ”ฮฆ์„ update ํ•œ๋‹ค.
    - Log-likelihood function์œผ๋กœ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ๋•Œ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ฮฆ์˜ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.
    - ์ง๊ด€์ ์œผ๋กœ backpropagationํ•  ๋•Œ์˜ ๋ชจ๋ธ์„ ๋‚˜ํƒ€๋‚ด๋ฉด, ฮฆ=ฮฆ0+ฮ”ฮฆ{ฮฆ = ฮฆ_0 + ฮ”ฮฆ} ์ด๋ ‡๊ฒŒ ๋œ๋‹ค.

๐Ÿค” full fine-tuning์„ ํ•œ๋‹ค๋ฉด?
: "๊ฐ" downstream task๋ฅผ ์œ„ํ•ด โˆฃฮฆ0โˆฃ|ฮฆ_0| dimension ๊ณผ ๊ฐ™์€ ํฌ๊ธฐ์˜ โˆฃฮ”ฮฆโˆฃ|ฮ”ฮฆ| ์„ ๋งค๋ฒˆ ์žฌํ•™์Šต ํ•ด์•ผํ•œ๋‹ค.
=> GPT ๊ฐ™์€ LLM ๋ชจ๋ธ์ด๋ผ๋ฉด, ์—„์ฒญ๋‚œ ์–‘์˜ ๋น„์šฉ์ด ๋“ ๋‹ค!
=> ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž LoRA๋Š” updateํ–‰ ํ•˜๋Š” parameter๋ฅผ ฮ”ฮฆ=ฮ”ฮฆ(ฮ˜)ฮ”ฮฆ = ฮ”ฮฆ(ฮ˜)๋กœ ์น˜ํ™˜ํ•˜์—ฌ ํ›จ์”ฌ ์ž‘์€ size์˜ parameter ฮ˜ฮ˜๋กœ ๋Œ€์ฒด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด๋‹ค! (accmulated gradient values ฮ”ฮฆฮ”ฮฆ)

  1. ฮ”ฮฆ=ฮ”ฮฆ(ฮ˜)ฮ”ฮฆ = ฮ”ฮฆ(ฮ˜)๋กœ ์น˜ํ™˜ํ•˜์—ฌ ๋ชฉ์ ํ•จ์ˆ˜๊ฐ€ ์•„๋ž˜ ์ˆ˜์‹์ฒ˜๋Ÿผ ์ •์˜๋˜๊ฒŒ ๋จ.

  • ์ฆ‰ ๊ธฐ์กด์˜ log-likelihood ๋ฌธ์ œ์—์„œ ๋ชจ๋ธ์ด backpropagation ๊ณผ์ •์—์„œ ์ด์šฉ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์—ฐ์‚ฐ๋ฌธ์ œ๋ฅผ ๋” ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ฮ˜๋กœ ์น˜ํ™˜ํ•˜์—ฌ ํ’€๊ฒ ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.
  • โˆฃฮ˜โˆฃ|ฮ˜| << โˆฃฮฆ0โˆฃ|ฮฆ_0| ์ด๊ธฐ์—, ์ตœ์ ์˜ ฮ”ฮฆฮ”ฮฆ๋ฅผ ์ฐพ๋Š” task๋Š” ฮ˜ฮ˜๋ฅผ optimizationํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋Œ€์ฒด๋œ๋‹ค.

3. Arenโ€™t Existing Solutions Good Enough?

์šฐ๋ฆฌ๊ฐ€ ํ•ด๊ฒฐํ•˜๋ ค๊ณ  ํ•˜๋Š” ๋ฌธ์ œ๋Š” ๊ฒฐ์ฝ” ์ƒˆ๋กœ์šด ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค.
transformer learning์ด ์‹œ์ž‘๋œ ์ด๋ž˜๋กœ ์ˆ˜์‹ญ ๊ฐœ์˜ ์ž‘์—…์ด model adaptation์—์„œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ฐ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ•ด์™”๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด LM ์—์„œ ์‚ฌ์šฉํ•˜๋ฉด ํšจ์œจ์ ์ธ adaptation ๊ณผ ๊ด€๋ จํ•˜์—ฌ ๋‘ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ์ „๋žต์ด ์žˆ๋‹ค.
1) Adapter Layer ์ถ”๊ฐ€
2) input layer activation์˜ ์ผ๋ถ€ ํ˜•ํƒœ๋ฅผ ์ตœ์ ํ™”

=> ๋‘ ์ „๋žต ๋ชจ๋‘ ์ œํ•œ ์‚ฌํ•ญ์ด ์žˆ์œผ๋ฉฐ, ํŠนํžˆ ๋Œ€๊ทœ๋ชจ ๋ฐ ๋Œ€๊ธฐ ์‹œ๊ฐ„์— ๋ฏผ๊ฐํ•œ ํ”„๋กœ๋•์…˜ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ๋”์šฑ ๊ทธ๋ ‡๋‹ค.

Adapter Layers Introduce Inference Latency

Adapter Layer๋Š” ์ถ”๋ก  ์ง€์—ฐ ์‹œ๊ฐ„์„ ๋ฐœ์ƒ์‹œํ‚จ๋‹ค. ๋Œ€๊ทœ๋ชจ ์‹ ๊ฒฝ๋ง์€ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ๋‚ฎ๊ฒŒ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋“œ์›จ์–ด ๋ณ‘๋ ฌ์„ฑ์— ์˜์กดํ•˜๋ฉฐ ์–ด๋Œ‘ํ„ฐ ๊ณ„์ธต์€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋˜์–ด์•ผ ํ•œ๋‹ค.
์•„๋ž˜ ํ‘œ์—์„œ ๋‹จ์ผ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋ณ‘๋ชฉ ํ˜„์ƒ ํฌ๊ธฐ๊ฐ€ ๋งค์šฐ ์ž‘์€ ๊ฒฝ์šฐ์—๋„ ์–ด๋Œ‘ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์ง€์—ฐ ์‹œ๊ฐ„์ด ๋ˆˆ์— ๋„๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

Directly Optimizing the Prompt is Hard

ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ง์ ‘ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค.
๋Œ€ํ‘œ์ ์œผ๋กœ Prefix tuning ๋ฐฉ์‹์ด ์žˆ๋‹ค.

๐Ÿค” Prefix tuning


: ์—ฐ์†์ ์ธ ํƒœ์Šคํฌ ํŠนํ™” ๋ฒกํ„ฐ(continuous task-specific vector = prefix)๋ฅผ ํ™œ์šฉํ•ด ์–ธ์–ด๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ์–ธ์–ด๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๊ณ ์ •ํ•œ ์ƒํƒœ(=frozen)
  • continuous vector/virtual tokens์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์—์„œ ์ž์—ฐ์–ด(discrete tokens)๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ ‘๊ทผ๋ฐฉ๋ฒ•๊ณผ ๊ตฌ๋ถ„๋จ
  • ํ•˜๋‚˜์˜ ์–ธ์–ด๋ชจ๋ธ๋กœ ์—ฌ๋Ÿฌ ๊ฐœ์˜ Task๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ (prefix๋ฅผ ํ•™์Šต)
  • ์—ฐ์†์ ์ธ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์„ค๋ช…(instruction)์„ ์ตœ์ ํ™”ํ•จ์œผ๋กœ์จ, ๋ชจ๋“  Transformer activation๊ณผ ์ดํ›„์— ๋“ฑ์žฅํ•˜๋Š” ์—ฐ์†์ ์ธ ํ† ํฐ๋“ค์— ์˜ํ–ฅ์„ ์คŒ
  • ๋ชจ๋“  ๋ ˆ์ด์–ด์˜ prefix๋ฅผ ์ตœ์ ํ™”

์šฐ๋ฆฌ๋Š” Prefix tuning์ด ์ตœ์ ํ™”ํ•˜๊ธฐ ์–ด๋ ต๊ณ  ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์—์„œ ์„ฑ๋Šฅ์ด ๋‹จ์กฐ๋กญ์ง€ ์•Š๊ฒŒ ๋ณ€๊ฒฝ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ•˜์—ฌ ์›๋ณธ ๋…ผ๋ฌธ์—์„œ ์œ ์‚ฌํ•œ ๊ด€์ฐฐ์„ ํ™•์ธํ–ˆ๋‹ค.
โ— Adaptation์„ ์œ„ํ•ด ์‹œํ€€์Šค ๊ธธ์ด์˜ ์ผ๋ถ€๋ฅผ ์˜ˆ์•ฝํ•˜๋ฉด ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ํ•„์—ฐ์ ์œผ๋กœ ์ค„์–ด๋“ค๊ณ , ์ด๋กœ ์ธํ•ด ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์— ๋น„ํ•ด ํ”„๋กฌํ”„ํŠธ ์กฐ์ • ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒƒ์œผ๋กœ ์˜์‹ฌ๋œ๋‹ค.

4. Our Method

4.1. Low-Rank-Parametrized Update Matrices

โœ… LoRA๋Š” adaptation ๋™์•ˆ์— low intrinsic rank๋ฅผ ๊ฐ€์ง„ weight๋กœ updateํ•˜๋Š” ๋ฐฉ๋ฒ•!

  • ์ˆ˜ํ•™์ ์œผ๋กœ pre-trained weight matrix W0โˆˆRdร—kW_0 \in \mathbb{R}^{d \times k } ์— ๋Œ€ํ•ด W0+ฮ”W=W0+BAW_0 + \Delta W = W_0 + BA๋กœ updateํ•˜๋Š” ๊ฒƒ!
  • ์ฆ‰, W0W_0์€ frozen๋˜๊ณ  low rank๋กœ decomposition๋œ BโˆˆRdร—rB \in \mathbb{R}^{d \times r}์™€ AโˆˆRrร—kA \in \mathbb{R}^{r \times k}๋งŒ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ (rankrโ‰ชmin(d,k)rankr โ‰ช min(d, k)์„ ๋งŒ์กฑํ•จ)
  • WW์™€ ฮ”W=BA\Delta W = BA๋Š” ๊ฐ™์€ input์— ๊ณฑํ•ด์ง€๊ณ  ๊ทธ๋“ค์˜ output vector๋Š” coordinate-wiseํ•˜๊ฒŒ ํ•ฉ(summation)ํ•œ๋‹ค. forward pass๋ฅผ ํ‘œํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • A๋Š” random Gaussian initialization๋˜๊ณ  B๋Š” 0์œผ๋กœ initialization๋œ๋‹ค. (training ์‹œ์ž‘ ์‹œ์— AW = BA๋˜ํ•œ 0)
  • ฮ”Wx\Delta W x๋Š” ฮฑr\frac{ \alpha}{r}์œผ๋กœ scaling๋œ๋‹ค.
  • Adam์œผ๋กœ optimization ํ•  ๋•Œ ฮฑ\alpha๋ฅผ tuningํ•˜๋Š” ๊ฒƒ์€ learning rate๋ฅผ tuningํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ํ•˜์˜€๋‹ค. ๊ทธ๋ž˜์„œ ฮฑ\alpha์„ ์ฒ˜์Œ rr๊ฐ’์œผ๋กœ ์ •ํ•˜์˜€๋‹ค. Scaling์€ r๊ฐ’์„ ๋ณ€ํ™”์‹œํ‚ฌ๋•Œ hyperparameter ๋ฅผ ์žฌ์กฐ์ •ํ•  ํ•„์š”๋ฅผ ์ค„์ด๋Š” ๋ฐ ๋„์›€์ด ๋œ๋‹ค.
  • ํ›ˆ๋ จ๊ณผ์ •์—์„œ W0๋Š” gradient update๋ฅผ ํ•˜์ง€ ์•Š๊ณ , ์˜คํžˆ๋ ค BA๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

์•„๋ž˜์˜ ์œ„์˜ ๋‚ด์šฉ์„ ์‹ค์ œ๋กœ ๊ตฌํ˜„ํ•œ LoRA ๊ณต์‹ github์† ๊ตฌํ˜„ ์ฝ”๋“œ์ด๋‹ค.

A Generalization of Full Fine-tuning.

  • LoRA๋Š” ๋ณด๋‹ค ์ผ๋ฐ˜์ ์ธ ํ˜•ํƒœ์˜ fine-tuning์„ ํ†ตํ•ด pre-trained parameters์˜ ํ•˜์œ„ ์ง‘ํ•ฉ์„ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๋‹ค.
  • LoRA๋Š” ํ•œ ๋‹จ๊ณ„ ๋” ๋‚˜์•„๊ฐ€ adaptation ์ค‘์— full-rank๋ฅผ ๊ฐ–๊ธฐ ์œ„ํ•ด ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ๋Œ€ํ•œ accumulated gradient update๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š๋‹ค.
  • ์ฆ‰, ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋ฉด LoRA ํ›ˆ๋ จ์€ ๋Œ€๋žต ์›๋ณธ ๋ชจ๋ธ ํ›ˆ๋ จ์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๋ฐ˜๋ฉด, adapter ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ MLP๋กœ ์ˆ˜๋ ดํ•˜๊ณ  prefix ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•์€ ๊ธด ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋Š” ๋ชจ๋ธ๋กœ ์ˆ˜๋ ดํ•œ๋‹ค.

No Additional Inference Latency.

  • LoRA๋Š” ์ถ”๊ฐ€ ์ถ”๋ก  ์ง€์—ฐ ์‹œ๊ฐ„์ด ์—†๋‹ค.
  • LoRA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ inferenceํ•˜๋ ค๊ณ  ํ•  ๋•Œ๋Š” ๊ธฐ์กด pre-trained weight WoW_o์— ํ•™์Šตํ•œ BABA๋ฅผ ๋”ํ•ด์ฃผ๊ณ  ์‚ฌ์šฉํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— infernece latency ์„ฑ๋Šฅ ํ•˜๋ฝ์€ ์ „ํ˜€ ์—†๋‹ค.
  • W0W_0์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋˜ ๋‹ค๋ฅธ task๋กœ ํ•™์Šตํ•œ B'A'๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ BA์„ ๋นผ์ฃผ๊ณ 
    B'A'์„ ๋”ํ•ด์ฃผ์–ด ์‚ฌ์šฉํ•˜๋ฉด ๋˜๊ธฐ ๋•Œ๋ฌธ์— reusability์ด ์ข‹๋‹ค.

4.2. Applying LoRA to Transformer

โœ… ๋…ผ๋ฌธ์—์„œ๋Š” trainable weight๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ์œ„ํ•ด LoRA๋ฅผ ๋ชจ๋“  layer ๋ฐ module์— ์ ์šฉํ•˜์ง€์•Š์•˜๋‹ค.
โ— ์˜ค์ง LoRA๋ฅผ Transformer์˜ attention weights์ธ WqW_q๋˜๋Š” WkW_k, WvW_v ์—๋งŒ ์ ์šฉํ•˜์˜€๊ณ  ๋‚˜๋จธ์ง€ MLP module์—๋Š” ์ ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.

(์‹ค์ œ ์„ฑ๋Šฅ ์‹คํ—˜์—์„œ๋Š” Wg์™€ Wu ์—๋งŒ LoRA๋ฅผ ์ ์šฉํ•˜์˜€๋‹ค.)

์ด๋ ‡๊ฒŒ ์…‹ํŒ…ํ•˜๊ณ  ์ง„ํ–‰ํ•จ์œผ๋กœ์จ 1,750์–ต๊ฐœ์˜ parameter๋ฅผ ๊ฐ€์ง„ GPT-3์— ๋Œ€ํ•ด fine-tuning์‹œ์— ์›๋ž˜ VRAM๋ฅผ 1.2TB์‚ฌ์šฉํ•˜๋˜ ๊ฒƒ์ด LoRA๋ฅผ ํ†ตํ•ด 350GB๋กœ ์ค„์–ด๋“ค์—ˆ๋‹ค.
๋˜ํ•œ training speed๋˜ํ•œ 25%๊ฐ€๋Ÿ‰ ์ค„์—ˆ๋‹ค.

5. Empirical Experiments

5.1. Baselines

  • Fine-Tuning (FT) : fine-tuning ์ค‘์— ๋ชจ๋ธ์€ pre-trained weights์™€ bias๋กœ ์ดˆ๊ธฐํ™”๋˜๊ณ  ๋ชจ๋“  ๋ชจ๋ธ parameters๋Š” ๊ธฐ์šธ๊ธฐ updater๋ฅผ ๊ฑฐ์นœ๋‹ค
  • Bias-only or BitFit : ๋‹ค๋ฅธ ๋ชจ๋“  ๊ฒƒ์„ frozenํ•˜๋ฉด์„œ bias vector ๋งŒ ํ›ˆ๋ จ
  • Prefix-embedding tuning (PreEmbed) : ์ž…๋ ฅ ํ† ํฐ ์‚ฌ์ด์— ํŠน์ˆ˜ ํ† ํฐ์„ ์‚ฝ์ž…, (ํŠน์ˆ˜ ํ† ํฐ์—๋Š” ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์ด ์žˆ์œผ๋ฉฐ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ชจ๋ธ์˜ ์–ดํœ˜์—๋Š” ์—†์Œ)
  • Prefix-layer tuning (PreLayer) : prefix-embedding tuning์˜ ์—ฐ์žฅ, ์ผ๋ถ€ ํŠน์ˆ˜ ํ† ํฐ์— ๋Œ€ํ•ด ์ž„๋ฒ ๋”ฉ์ด๋ผ๋Š” ๋‹จ์–ด(๋˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด ์ดํ›„์˜ ํ™œ์„ฑํ™”)๋ฅผ ํ•™์Šตํ•˜๋Š” ๋Œ€์‹  ๋ชจ๋“  Transformer ๋ ˆ์ด์–ด ์ดํ›„์˜ ํ™œ์„ฑํ™”๋ฅผ ํ•™์Šต.
  • Adapter tuning : self-attention ๋ชจ๋“ˆ(๋ฐ MLP ๋ชจ๋“ˆ)๊ณผ ํ›„์† ์ž”์—ฌ ์—ฐ๊ฒฐ ์‚ฌ์ด์— ์–ด๋Œ‘ํ„ฐ ๋ ˆ์ด์–ด๋ฅผ ์‚ฝ์ž…
  • LoRA : ๊ธฐ์กด ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๊ณผ ๋ณ‘๋ ฌ๋กœ ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ์ˆœ์œ„ ๋ถ„ํ•ด ํ–‰๋ ฌ ์Œ์„ ์ถ”๊ฐ€, WkW_k, WvW_v ์—๋งŒ ์ ์šฉ

5.4. GPT-2 medium/large

  • E2E NLG Challenge์—์„œ ์•ž์„œ ์ •์˜ํ•œ ๋‹ค์–‘ํ•œ adaptation ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” GPT-2 medium(M) ๋ฐ large(L)๋ชจ๋ธ๋ง ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ ๋ชจ๋“  ์ง€ํ‘œ์— ๋Œ€ํ•ด ๋†’์„์ˆ˜๋ก ์ข‹๋‹ค.
  • ์œ„์˜ ํ‘œ๋ฅผ ๋ณด๋ฉด, LoRA๋Š” ํ›ˆ๋ จ ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ์ ์€ ์—ฌ๋Ÿฌ ๊ธฐ์ค€์„ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค.

  • GPT-3 175B์—์„œ ๋‹ค์–‘ํ•œ ์ ์‘ ๋ฐฉ๋ฒ•์˜ ์„ฑ๋Šฅ์„ ์œ„์˜ ํ‘œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • WikiSQL์˜ ๋…ผ๋ฆฌ ํ˜•์‹ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์ •ํ™•๋„, MultiNLI ์ผ์น˜์˜ ์œ ํšจ์„ฑ ๊ฒ€์‚ฌ ์ •ํ™•๋„, SAMSum์˜ Rouge-1/2/L์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.
  • LoRA๋Š” ์ผ๋ฐ˜์ ์ธ Fine-tuning(FT)๋ฅผ ํฌํ•จํ•˜์—ฌ ์ด์ „ ์ ‘๊ทผ ๋ฐฉ์‹๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ฐœํœ˜ํ–ˆ๋‹ค.

5.5. Scaling up to GPT-3 175B

  • LoRA์˜ ์ตœ์ข… ์ŠคํŠธ๋ ˆ์Šค ํ…Œ์ŠคํŠธ๋กœ 1,750์–ต ๊ฐœ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ฐ–์ถ˜ GPT-3๊นŒ์ง€ ํ™•์žฅํ•˜์˜€๋‹ค.
  • ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ณด๋ฉด, LoRA๋Š” ์„ธ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋ชจ๋‘์—์„œ fine-tuning ๊ธฐ์ค€์„ ๊ณผ ์ผ์น˜ํ•˜๊ฑฐ๋‚˜ ์ดˆ๊ณผํ•œ๋‹ค.
  • LoRA๋ฅผ ์ด์šฉํ•˜์˜€์„ ๋•Œ ํ•ด๋‹น fields์—์„œ SOTA๋ฅผ ๋‹ฌ์„ฑํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, GPT-3์˜ ๊ฒฝ์šฐ 175B์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ€์šด๋ฐ 0.01%์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๋งŒ ์ด์šฉํ•  ์ •๋„๋กœ ํšจ์œจ์„ฑ์ด ์ข‹๋‹ค.
  • ๋ชจ๋“  ๋ฐฉ๋ฒ•์ด ๋” ๋งŽ์€ trainable parameters๋ฅผ ๊ฐ–๋Š”๋‹ค๊ณ  ํ•ด์„œ ์ด์ ์„ ์–ป๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋‹ค.
  • ์ ‘prefix-embedding tuning์— 256๊ฐœ ์ด์ƒ์˜ ํŠน์ˆ˜ ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ prefix-layer tuning์— 32๊ฐœ ์ด์ƒ์˜ ํŠน์ˆ˜ ํ† ํฐ์„ ์‚ฌ์šฉํ•˜๋ฉด ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋˜๋Š” ๊ฒƒ์„ ๊ด€์ฐฐํ–ˆ๋‹ค.

6. Related Works

  • Transformer Language Models
  • Prompt Engineering and Fine-Tuning
  • Parameter-Efficient Adaptation
  • Low-Rank Structures in Deep Learning

7. Understanding the Low-Rank Updates

7.1. Which Weight Matrices in Transformer Should We Apply LoRA to?

  • WqW_q, WvW_v์— ๋ชจ๋‘ LoRA๋ฅผ ์ ์šฉํ•˜๋Š”๊ฒŒ ๊ฐ€์žฅ best performance๋ฅผ ๋‚ธ๋‹ค.

7.2. What is the Optimal Rank rr for LoRA?

  • LoRA๋ฅผ WqW_q, WvW_v์— ๋ชจ๋‘ ์ ์šฉํ•  ๋•Œ๋Š” r=1์ผ ๋•Œ ๊ฝค ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜จ๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • WqW_q์—๋งŒ LoRA๋ฅผ ์ ์šฉํ•  ๊ฒฝ์šฐ ์ข€ ๋” ํฐ r์ด ๋‚ซ๋‹ค๋Š” ๊ฒƒ์„ ์œ„์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

7.3How Does the Adaptation Matrix ฮ”Wฮ”W Compare to WW?

  1. ฮ”Wฮ”W๋Š” WW์™€ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„ ์žˆ๋‹ค. (compared to a random matrix), ฮ”Wฮ”W๋Š” WW์— ์ด๋ฏธ ์žˆ๋Š” ์ผ๋ถ€ feature์„ ์ฆํญ์‹œํ‚จ ๊ฒƒ์ด๋‹ค.
  2. WW์˜ ์ƒ๋‹จ ๋‹จ๋ฐฉํ–ฅ์„ ๋ฐ˜๋ณตํ•˜๋Š” ๋Œ€์‹ , ฮ”Wฮ”W๋Š” WW์—์„œ ๊ฐ•์กฐ๋˜์ง€ ์•Š์€ ๋ฐฉํ–ฅ๋งŒ ์ฆํญ์‹œํ‚จ๋‹ค.
  3. ์ฆํญ ์ธ์ž๋Š” ๋‹ค์†Œ ํฌ๋‹ค (r=4์˜ ๊ฒฝ์šฐ 21.5 โ‰’ 6.91/0.32)

8. Conclusion and Future Work

  • ๊ฑฐ๋Œ€ํ•œ ์–ธ์–ด ๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ํ•„์š”ํ•œ ํ•˜๋“œ์›จ์–ด์™€ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ์œ„ํ•œ ๋…๋ฆฝ์ ์ธ ์ธ์Šคํ„ด์Šค๋ฅผ ํ˜ธ์ŠคํŒ…ํ•˜๊ธฐ ์œ„ํ•œ ์ €์žฅ/์ „ํ™˜ ๋น„์šฉ ์ธก๋ฉด์—์„œ ์—„์ฒญ๋‚˜๊ฒŒ ๋น„์‹ธ๋‹ค.
  • ์šฐ๋ฆฌ๋Š” ๋†’์€ ๋ชจ๋ธ ํ’ˆ์งˆ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ถ”๋ก  ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ๋„์ž…ํ•˜๊ฑฐ๋‚˜ ์ž…๋ ฅ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ค„์ด์ง€ ์•Š๋Š” ํšจ์œจ์ ์ธ ์ ์‘ ์ „๋žต์ธ LoRA๋ฅผ ์ œ์•ˆํ•œ๋‹ค.
  • ๐Ÿ’ก ์ค‘์š”ํ•œ ์ ์€ ๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๊ณต์œ ํ•˜์—ฌ ์„œ๋น„์Šค๋กœ ๋ฐฐํฌํ•  ๋•Œ ๋น ๋ฅธ ์ž‘์—… ์ „ํ™˜์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
  • Transformer ์–ธ์–ด ๋ชจ๋ธ์— ์ค‘์ ์„ ๋‘์—ˆ์ง€๋งŒ ์ œ์•ˆ๋œ ์›๋ฆฌ๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฐ€๋„๊ฐ€ ๋†’์€ ๊ณ„์ธต์ด ์žˆ๋Š” ๋ชจ๋“  ์‹ ๊ฒฝ๋ง์— ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

๐Ÿค” Future Work
1) LoRA๋Š” ๋‹ค๋ฅธ ํšจ์œจ์ ์ธ ์ ์‘ ๋ฐฉ๋ฒ•๊ณผ ๊ฒฐํ•ฉ๋˜์–ด ์ž ์žฌ์ ์œผ๋กœ ์ง๊ต ๊ฐœ์„ ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค.
2) ๋ฏธ์„ธ ์กฐ์ • ๋˜๋Š” LoRA์˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ๋ช…ํ™•ํ•˜์ง€ ์•Š๋‹ค. ์‚ฌ์ „ ํ›ˆ๋ จ ์ค‘์— ํ•™์Šต๋œ ๊ธฐ๋Šฅ์ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ์ž‘์—…์—์„œ ์ž˜ ์ˆ˜ํ–‰๋˜๋„๋ก ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜๋˜๋‚˜?
3) LoRA๋ฅผ ์ ์šฉํ•  ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋กœ ํœด๋ฆฌ์Šคํ‹ฑ์— ์˜์กดํ•œ๋‹ค. ๋” ์›์น™์ ์ธ ๋ฐฉ๋ฒ•์ด ์žˆ๋‚˜?

๐Ÿ”– Reference
LoRA ๋…ผ๋ฌธ
LoRA ๋…ผ๋ฌธ ์„ค๋ช…1
LoRA ๋…ผ๋ฌธ ์„ค๋ช…2
prefix-tuning

profile
๋ฐฐ์šฐ๊ณ  ๊ณต๋ถ€ํ•˜๊ณ  ๊ธฐ๋กํ•˜๋Š” ๊ฒƒ์„ ๋ฉˆ์ถ”์ง€ ์•Š๋Š”๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€