QLoRA (Quantized Low-Rank Adaptation)

김동준·2025년 10월 21일

QLoRA (Quantized Low-Rank Adaptation) 완전 가이드

QLoRA란 무엇인가?

QLoRA(Quantized Low-Rank Adaptation)는 대규모 언어 모델을 효율적으로 파인튜닝하기 위한 혁신적인 기법입니다[1][2][3]. 전통적인 파인튜닝 방식이 수백 GB의 GPU 메모리를 요구하는 것과 달리, QLoRA는 4-bit 양자화와 LoRA(Low-Rank Adaptation)를 결합하여 단일 소비자용 GPU에서도 수십억 개의 파라미터를 가진 모델을 파인튜닝할 수 있게 합니다[4][5].

QLoRA의 가장 인상적인 성과는 65B 파라미터 모델을 단일 48GB GPU에서 파인튜닝하면서도 16-bit 전체 파인튜닝과 동일한 성능을 유지한다는 점입니다[6][7]. 이는 일반 연구자들에게도 대규모 모델 파인튜닝의 기회를 열어줍니다[8].

LoRA 기초 이해

QLoRA를 이해하기 위해서는 먼저 LoRA의 개념을 알아야 합니다[9][10][11].

LoRA의 핵심 아이디어

LoRA는 Low-Rank Adaptation(저순위 적응)의 약자로, 사전 학습된 모델의 가중치를 고정(frozen)시킨 채로 작은 어댑터 행렬만 학습하는 방식입니다[12][13].

핵심 원리는 다음과 같습니다:

원본 가중치 고정: 사전 학습된 모델의 가중치 행렬 $W$ 를 고정합니다
저순위 분해: 가중치 업데이트 $\Delta W$ 를 두 개의 작은 행렬 $A$ 와 $B$ 로 분해합니다
수식: $W' = W + \Delta W = W + BA$
- $W$ : $d \times d$ 크기의 원본 가중치
- $B$ : $d \times r$ 크기의 행렬
- $A$ : $r \times d$ 크기의 행렬
- $r$ : rank (보통 4, 8, 16 등 작은 값)

여기서 $r \ll d$ 이므로 학습해야 할 파라미터 수가 극적으로 감소합니다[10][14].

파라미터 절약 효과:

원본 행렬: 12,288 × 12,288 = 150,994,944개 파라미터
LoRA (r=8): 12,288 × 8 + 8 × 12,288 = 196,608개 파라미터
파라미터 감소율: 99.87% (약 768배 절약)

LoRA의 장점

메모리 효율성: 전체 파라미터의 0.5-5%만 학습[2][15]
학습 속도: 업데이트할 파라미터가 적어 빠름[16]
과적합 방지: 작은 파라미터 세트로 학습하여 안정적[2]
모듈성: 어댑터를 교체하여 여러 작업에 활용 가능[15]
추론 지연 없음: 어댑터를 베이스 모델과 병합 가능[16][13]

QLoRA의 3가지 핵심 기술

QLoRA는 LoRA에 세 가지 혁신적인 기술을 추가합니다[3][17][18]:

1. 4-bit NormalFloat (NF4) 양자화

NF4는 QLoRA의 가장 핵심적인 방법론입니다[8][19][20].

작동 원리

NF4는 분위 양자화(Quantile Quantization) 개념에 기반합니다[20][21]:

정규 분포 가정: 신경망 가중치는 일반적으로 정규 분포를 따릅니다[18]
분위 계산: 가중치 분포의 누적 분포 함수를 통해 분위수를 추정합니다[20]
균등 할당: 각 양자화 구간에 동일한 개수의 값이 할당되도록 합니다[22]
정규화: 데이터를 [-1, 1] 범위로 정규화한 후 양자화합니다[20][23]

NF4는 정보 이론적으로 최적(information-theoretically optimal)인 데이터 타입으로, 2^4=16개의 숫자만으로도 정규 분포 가중치를 효과적으로 표현합니다[3][18][8].

동작 메커니즘

저장: 모델 가중치는 4-bit NF4 형식으로 저장됩니다
연산: Forward/Backward pass 시 BFloat16으로 역양자화(dequantize)하여 계산합니다[20][21]
LoRA 어댑터: 16-bit BrainFloat로 유지되어 높은 정밀도를 보장합니다[19][21]

양자화 상수: NF4 양자화 시 absmax 값으로 나누는 과정에서 사용되는 값으로, 모델 외부에 별도로 저장됩니다[20].

2. Double Quantization (이중 양자화)

Double Quantization은 양자화 상수를 다시 양자화하여 추가 메모리를 절약하는 기법입니다[8][21][24].

메모리 절약 효과

Block-wise 양자화를 사용하면 각 블록마다 양자화 상수가 생성됩니다[20]
이 상수들을 다시 양자화하면 파라미터당 평균 0.37 bits 절약 가능합니다[21][24]
블록 크기 64 기준: 32/64 = 0.5 bits → 8/64 + 32/(64×256) = 0.127 bits로 감소[8]
65B 모델에서 약 3GB 메모리 절약 효과[24]

성능 저하 없이 메모리 공간을 더욱 세밀하게 제어할 수 있어, 특정 크기 모델(33B/65B)을 특정 GPU(24GB/48GB)에 정확히 맞출 수 있습니다[21].

3. Paged Optimizers (페이징 옵티마이저)

Paged Optimizers는 GPU 메모리 부족 시 발생하는 OOM(Out Of Memory) 오류를 방지합니다[8][25][24].

작동 방식

NVIDIA 통합 메모리 활용[21][24]
GPU 메모리가 부족하면 optimizer state를 자동으로 CPU RAM으로 이동[8][25]
Optimizer 업데이트 단계에서 필요할 때 다시 GPU로 페이징[8][25]
긴 시퀀스 처리 시 발생하는 메모리 스파이크 관리[3][26]

이는 전통적으로 gradient checkpoint 중 메모리 급증으로 인해 어려웠던 대형 모델 파인튜닝을 단일 머신에서 가능하게 합니다[21].

QLoRA의 성능과 메모리 효율성

메모리 요구사항 비교

모델 크기	Full Fine-tuning	LoRA (16-bit)	QLoRA (4-bit)
7B	120 GB	16 GB	6 GB
13B	240 GB	32 GB	12 GB
30B	600 GB	64 GB	24 GB
65B	1,200 GB	160 GB	48 GB
70B	1,200 GB	160 GB	48 GB

[5][27][19]

파인튜닝 방법 비교

항목	Full Fine-tuning	LoRA	QLoRA
학습 파라미터 비율	~100%	0.5-5%	0.5-5%
메모리 효율성	낮음	높음	매우 높음
학습 속도	느림	빠름	중간
정확도	기준	거의 동일	거의 동일

[2][5][28]

성능 검증

QLoRA 논문의 실험 결과는 다음을 보여줍니다[8][21]:

NF4는 16-bit LoRA 성능을 완전히 회복합니다
FP4는 16-bit BrainFloat LoRA보다 약 1% 포인트만 뒤처짐[8]
Double Quantization은 성능 저하 없이 추가 메모리 절약[21]
MMLU 벤치마크에서 4-bit QLoRA가 16-bit 전체 파인튜닝과 동등한 성능[8]

최고 성능 모델인 Guanaco는 OASST1 데이터셋으로 QLoRA 파인튜닝하여 ChatGPT 성능의 99.3%에 도달했습니다[7][21].

실제 활용 예시

예시 1: 소비자용 GPU에서 13B 모델 파인튜닝

상황: RTX 4090 (24GB VRAM) 단 1개로 13B 모델 파인튜닝[19][29]

전통적 방법:

Full fine-tuning: 240GB 필요 → 불가능
LoRA (16-bit): 32GB 필요 → 불가능

QLoRA 사용:

메모리 요구: 12GB → 가능!
학습 시간: 24시간 이내
성능: 16-bit 파인튜닝과 거의 동일[19]

예시 2: 단일 GPU로 65B 모델 파인튜닝

하드웨어: A100 48GB GPU 1개[6][7][29]

성과:

65B 파라미터 모델 파인튜닝 성공
전통적으로 수백 GB 필요한 작업을 48GB에서 완료
메모리 사용량: 75-80% 절약[29]

코드 설정 (HuggingFace 기준)[3][23]:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# NF4 양자화 설정
nf4_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # 4-bit로 모델 로드
    bnb_4bit_quant_type="nf4",             # NF4 양자화 타입
    bnb_4bit_use_double_quant=True,        # 이중 양자화 활성화
    bnb_4bit_compute_dtype=torch.bfloat16  # 연산은 BF16
)

# 모델 로드
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-65b-hf",
    quantization_config=nf4_config,
    device_map="auto"
)

# LoRA 설정
lora_config = LoraConfig(
    r=8,                                    # rank
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],   # attention 레이어 타겟
    lora_dropout=0.05
)

model = get_peft_model(model, lora_config)

예시 3: 비용 효율적인 클라우드 파인튜닝

시나리오: RunPod에서 30B 모델 파인튜닝[29]

비용 비교:

전통적 방법: 8x A100 (640GB) 필요, 시간당 $20-30
QLoRA 방법: RTX A6000 (48GB) 1-2개, 시간당 $0.5-1.0
비용 절감: 20-60배

실용적 팁[29]:

Spot instance 활용 시 추가 70% 할인 가능
Checkpoint로 중단 시 재개 가능
Community Cloud에서 저렴한 GPU 활용

QLoRA 사용 시 고려사항

언제 QLoRA를 사용할까?

QLoRA를 선택해야 하는 경우[5][30]:

매우 큰 모델(30B-65B+) 파인튜닝이 필요할 때
GPU 메모리가 제한적일 때 (24GB 이하)
비용을 최소화해야 할 때
여러 모델을 실험해야 할 때

LoRA를 선택해야 하는 경우[5][30]:

모델이 GPU 메모리에 충분히 맞을 때 (7B-13B)
더 간단한 설정을 선호할 때
최대 학습 속도가 필요할 때

장단점

장점[3][18][27]:

극대화된 메모리 절약: 75-80% 메모리 사용 감소
접근성: 소비자용 GPU로도 대형 모델 파인튜닝 가능
성능 유지: 16-bit 파인튜닝과 거의 동일한 정확도
비용 효율성: 저렴한 하드웨어로 실험 가능

단점 및 주의사항[19][30][31]:

LoRA보다 50-200% 느린 학습 속도 (dequantization 오버헤드)[32]
양자화로 인한 미세한 정확도 손실 가능 (보통 1% 미만)[31]
설정이 더 복잡함 (quantization 파라미터 조정 필요)[30]
일부 GPU에서 양자화 지원이 제한적일 수 있음[30]

실무 적용 팁

하이퍼파라미터 권장 설정

LoRA 파라미터[33][34]:

r (rank): 4-16 (보통 8이 좋은 시작점)[14]
lora_alpha: 16-32
target_modules: ["q_proj", "k_proj", "v_proj", "o_proj"] (attention 레이어)
lora_dropout: 0.05

학습 파라미터[34]:

Learning rate: 2e-4 (QLoRA 논문 기준)
Batch size: 1-4 per device
Gradient accumulation: 4-8 steps
Optimizer: paged_adamw_8bit

메모리 최적화 추가 기법

Gradient Checkpointing: 중간 activation 재계산으로 메모리 절약[19][34]
Gradient Accumulation: 작은 배치 크기로 대형 배치 효과[34]
Mixed Precision Training: BF16/FP16 사용[23]

QLoRA vs LoRA: 최종 비교

측면	LoRA	QLoRA
메모리 사용	감소 (어댑터만)	대폭 감소 (4-bit + 어댑터)
속도	빠름	약간 느림 (양자화 오버헤드)
정확도	전체 파인튜닝과 거의 동일	전체 파인튜닝과 거의 동일
설정 난이도	쉬움, 널리 지원됨	복잡, PEFT + bitsandbytes 필요
최적 사용 케이스	중형 모델, 일반 GPU	대형 모델, 제한된 GPU
GPU 요구사항 (7B)	16 GB	6 GB
GPU 요구사항 (65B)	160 GB	48 GB

[5][30]

결론

QLoRA는 대규모 언어 모델 파인튜닝의 민주화를 가능하게 한 혁신적인 기술입니다[3][8]. 4-bit NF4 양자화, Double Quantization, Paged Optimizers라는 세 가지 핵심 기술을 통해 메모리 사용을 극적으로 줄이면서도 성능을 유지합니다[18][19].

단일 소비자용 GPU로도 수십억 파라미터 모델을 파인튜닝할 수 있게 되면서, 연구자와 소규모 조직도 맞춤형 AI 모델을 개발할 수 있는 길이 열렸습니다[2][29]. LoRA의 효율성과 양자화의 메모리 절약을 결합한 QLoRA는 현대 LLM 파인튜닝의 표준 도구로 자리잡고 있습니다[5][27].

출처
[1] Mastering QLoRa : A Deep Dive into 4-Bit Quantization and ... https://manalelaidouni.github.io/4Bit-Quantization-Models-QLoRa.html
[2] LoRA vs. QLoRA https://www.redhat.com/en/topics/ai/lora-vs-qlora
[3] What is QLoRA? | QLoRA – Weights & Biases https://wandb.ai/sauravmaheshkar/QLoRA/reports/What-is-QLoRA---Vmlldzo2MTI2OTc5
[4] QLoRA: Quantized Low-Rank Adapter https://wikidocs.net/252932
[5] LoRA vs. QLoRA: Efficient fine-tuning techniques for LLMs https://modal.com/blog/lora-qlora
[6][2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs https://arxiv.org/abs/2305.14314
[7] artidoro/qlora - Efficient Finetuning of Quantized LLMs https://github.com/artidoro/qlora
[8] QLoRA: Efficient Finetuning of Quantized LLMs 논문 리뷰 https://pred0771.tistory.com/244
[9] What is LoRA (Low-Rank Adaption)? https://www.ibm.com/think/topics/lora
[10] LoRA(Low-Rank Adaptation)를 파악해보자아앗!! - Day to_day https://day-to-day.tistory.com/69
[11] LoRA: Low-Rank Adaptation of Large Language Models https://arxiv.org/abs/2106.09685
[12] Low-rank adaptation (LoRA) fine tuning https://www.ibm.com/docs/en/watsonx/w-and-w/2.1.0?topic=tuning-lora-fine
[13] LoRA (Low-Rank Adaptation) - Hugging Face LLM Course https://huggingface.co/learn/llm-course/chapter11/4
[14] Fundamentals of LoRA and low‑rank fine-tuning https://nebius.com/blog/posts/fine-tuning/lora-low-rank-adaptation
[15] Fine-Tuning using LoRA and QLoRA https://www.geeksforgeeks.org/deep-learning/fine-tuning-using-lora-and-qlora/
[16] Efficient Fine-Tuning with LoRA for LLMs https://www.databricks.com/blog/efficient-fine-tuning-lora-guide-llms
[17] Quantized low-rank adaptation (QLoRA) fine tuning https://www.ibm.com/docs/en/watsonx/w-and-w/2.1.0?topic=tuning-qlora-fine
[18] What is QLoRA (Quantized Low-Rank Adapter)? https://www.geeksforgeeks.org/deep-learning/what-is-qlora-quantized-low-rank-adapter/
[19] AI 모델 경량화 시리즈 5편: QLoRA (Quantized LoRA) https://machineindeep.tistory.com/66
[20][QLoRA] QLoRA: Efficient Finetuning of Quantized LLMs https://velog.io/@kaiba0514/QLoRA-QLoRA-Efficient-Finetuning-of-Quantized-LLMs
[21][yongggg's] QLoRA: Efficient Finetuning of Quantized LLMs ... https://yongggg.tistory.com/46
[22] QLoRA란? https://velog.io/@nellcome/QLoRA%EB%9E%80
[23] 거대 언어 모델 튜닝을 위한 미니멀리스트 접근법: 2부 - QLoRA ... https://blog.kbanknow.com/82
[24][nlp][논문리뷰]QLoRA: Efficient Finetuning of Quantized LLMs https://velog.io/@0like/nlp%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0QLoRA-Efficient-Finetuning-of-Quantized-LLMs-8%EC%A3%BC%EC%B0%A8
[25] QLoRA: Efficient Finetuning of Quantized LLMs https://onebyonebyone.tistory.com/207
[26][2025-2] 박지원 - QLORA https://blog.outta.ai/320
[27] In-depth guide to fine-tuning LLMs with LoRA and QLoRA https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
[28] LLM의 다양한 SFT 기법: Full Fine-Tuning, PEFT (LoRA, QLoRA) https://ariz1623.tistory.com/348
[29] How can I fine-tune large language models on a budget ... https://www.runpod.io/articles/guides/how-to-fine-tune-large-language-models-on-a-budget
[30] 🔎 LoRA and QLoRA: Efficient Fine-Tuning for Large ... https://watercrawl.dev/blog/LoRA-and-QLoRA
[31] PEFT vs. QLoRA: Faster Fine-Tuning Methods https://www.artech-digital.com/blog/peft-vs-qlora-faster-fine-tuning-methods
[32] NeMo QLoRA Guide https://docs.nvidia.com/nemo-framework/user-guide/24.12/sft_peft/qlora.html
[33] Fine-Tuning Llama 3 with LoRA: Step-by-Step Guide https://neptune.ai/blog/fine-tuning-llama-3-with-lora
[34] Fine-Tune Gemma using Hugging Face Transformers and ... https://ai.google.dev/gemma/docs/core/huggingface_text_finetune_qlora
[35] QA-LoRA: Quantization-Aware Low-Rank Adaptation of ... https://arxiv.org/abs/2309.14717
[36] LLM Optimization: LoRA and QLoRA https://towardsdatascience.com/llm-optimization-lora-and-qlora/
[37] QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large ... https://miniolife.tistory.com/32
[38] QLoRA를 활용한 LLM 파인튜닝 https://1119wj.tistory.com/25
[39] LoRA and QLoRA recommendations for LLMs https://cloud.google.com/vertex-ai/generative-ai/docs/model-garden/lora-qlora
[40] LoRA & QLoRA Fine-tuning Explained In-Depth https://www.youtube.com/watch?v=t1caDsMzWBk
[41] LoRA vs. QLoRA performance comparison #511 https://github.com/ml-explore/mlx-examples/issues/511
[42][논문 리뷰] QDyLoRA: Quantized Dynamic Low-Rank ... https://www.themoonlight.io/ko/review/qdylora-quantized-dynamic-low-rank-adaptation-for-efficient-large-language-model-tuning
[43] Top 5 AI Fine-Tuning Tools 2025: LoRA vs QLoRA vs Full https://www.index.dev/blog/top-ai-fine-tuning-tools-lora-vs-qlora-vs-full
[44][Paper Review] QLoRA: Efficient Finetuning of Quantized LLMs https://moomyung-lab.tistory.com/11
[45] 거대 언어 모델 튜닝을 위한 미니멀리스트 접근법: 1부 - PEFT ... https://blog.kbanknow.com/81
[46] QLoRA: Efficient Finetuning of Quantized LLMs - JHIN.LOG https://nogan.tistory.com/50
[47] 다양한 파인튜닝 기법 - Lapitel https://lapitel.tistory.com/6
[48][논문리뷰] QLORA: Efficient Finetuning of Quantized LLMs https://velog.io/@kameleon43/%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0-QLORA-Efficient-Finetuning-of-Quantized-LLMs
[49][QLoRA 리뷰] Qlora: Efficient finetuning of quantized llms https://sofar-sogood.tistory.com/entry/QLoRA-%EB%A6%AC%EB%B7%B0-Qlora-Efficient-finetuning-of-quantized-llms
[50] Generate and run fine-tuned models with LoRA adapters https://onnxruntime.ai/docs/genai/tutorials/finetune.html
[51][논문리뷰] LoRA: Low-Rank Adaptation of Large Language ... https://kimjy99.github.io/%EB%85%BC%EB%AC%B8%EB%A6%AC%EB%B7%B0/lora/
[52][2311.12023] LQ-LoRA: Low-rank Plus Quantized Matrix ... https://arxiv.org/abs/2311.12023
[53] LoRA: Low-Rank Adaptation of Large Language Models https://minair.tistory.com/75
[54][논문 퀵 리뷰] LQ-LoRA: Low-rank Plus Quantized Matrix ... https://liner.com/ko/review/lqlora-lowrank-plus-quantized-matrix-decomposition-for-efficient-language-model
[55] How to fine-tune a model using LoRA (step by step) https://www.youtube.com/watch?v=8N9L-XK1eEU
[56][논문 리뷰] LoRA-Mini : Adaptation Matrices Decomposition ... https://www.themoonlight.io/ko/review/lora-mini-adaptation-matrices-decomposition-and-selective-training
[57][논문] LoRA: Low-Rank Adaptation of Large Language Models https://jeongwooyeol0106.tistory.com/106
[58] a simple vanilla example of how to fine tune Llama 2 using ... https://github.com/vllm-project/vllm/issues/997
[59] LoRA (Low-Rank Adaptation of Large Language Models) https://mari970.tistory.com/47
[60] LQ-LoRA: Low-rank plus Quantized Matrix Decomposition ... https://openreview.net/forum?id=xw29VvOMmU
[61] QLoRA—How to Fine-tune an LLM on a Single GPU (w https://www.youtube.com/watch?v=XpoKB3usmKc
[62] Finetuning Large language models using QLoRA https://www.kaggle.com/code/neerajmohan/finetuning-large-language-models-using-qlora
[63] Helpful VRAM requirement table for qlora, lora, and full ... https://www.reddit.com/r/LocalLLaMA/comments/18o5u0k/helpful_vram_requirement_table_for_qlora_lora_and/
[64] LoRA vs Full Fine-tuning: An Illusion of Equivalence https://arxiv.org/html/2410.21228v1
[65] QLoRA: Efficient Finetuning of Quantized LLMs () https://underflow101.tistory.com/73
[66] Memory-efficient Fine-tuning with with QLoRA https://heidloff.net/article/qlora/

김동준

Story Engineer

이전 포스트

Deep Learning Papers Reading Roadmap

다음 포스트