PyTorch 2.7 양자화 완벽 정리: 모델은 어떻게 저장되고 동작할까?

Bean·2025년 7월 17일

DynamicQuantization PyTorch StaticQuantization int8연산 quantization 딥러닝최적화 모델경량화 모델압축 모바일AI 양자화

인공지능

목록 보기

89/138

양자화란?

양자화(Quantization)는 딥러닝 모델의 파라미터와 연산을 float32에서 int8 또는 uint8과 같은 저정밀도 정수형으로 바꾸는 기술입니다. 이를 통해 모델의 크기를 줄이고, 연산 속도를 높이며, 특히 모바일·엣지 디바이스에서 효율적인 추론을 가능하게 합니다.

PyTorch 2.7에서의 모델 저장 방식

1. Static Quantization (정적 양자화)

가중치(weights)와 활성값(activations)은 int8 또는 uint8 형식으로 저장
scale, zero_point 정보도 함께 포함
torch.quantized.Linear, torch.nn.intrinsic.quantized.ConvReLU2d 등 quantized operator 사용
내부적으로 torch._C.ScriptObject 형태로 저장되며, .state_dict()로 확인 시 가중치가 uint8/int8로 표시됨

예시 코드:

print(model.features[0].weight().dtype)  # torch.qint8 또는 torch.uint8

2. Dynamic Quantization (동적 양자화)

가중치만 int8로 양자화, 활성값은 float32 유지
실행 시점에만 활성값을 양자화하여 연산
저장된 모델 파일을 보면, 가중치만 int8로 저장됨

예시 코드:

model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
state_dict = model.state_dict()
for k, v in state_dict.items():
    print(k, v.dtype)  # 예: fc1._packed_params._packed_params torch.qint8

3. QAT (Quantization Aware Training)

훈련 시에는 float 연산 사용
추론 시에는 int8 연산 사용
가장 정밀도 손실이 적은 방식

정리 표

양자화 방식	가중치 저장 형식	활성값 처리 방식
Static Quantization	`int8`, `uint8`	`int8` 또는 `uint8`로 변환
Dynamic Quantization	`int8`	`float32` 유지, 실행 시 양자화
QAT	`int8`, `uint8`	훈련 시 float, 추론 시 `int8`

추론 시 float으로 변환될까?

❌ 정답: 아닙니다!

PyTorch에서 양자화된 모델은 float32로 복원하지 않고 그대로 int 연산을 수행합니다. 이는 연산 속도와 메모리 측면에서 큰 장점을 제공합니다.

추론 단계별 동작 방식

양자화 종류	추론 시 가중치	연산 형식	비고
Static Quantization	`int8`/`uint8`	대부분 연산이 `int8` 또는 `int32` (accumulate)	출력만 float 변환
Dynamic Quantization	`int8`	실행 중 float 변환 후 연산	일부만 정수 연산
QAT	`int8`	`int8` 연산	정밀도 손실 최소화

예시: Static Quantization 추론 코드

import torch
from torch.ao.quantization import quantize_static

model = torch.nn.Sequential(torch.nn.Linear(10, 5))
model.eval()

model.qconfig = torch.ao.quantization.get_default_qconfig("fbgemm")
torch.ao.quantization.prepare(model, inplace=True)
torch.ao.quantization.convert(model, inplace=True)

with torch.no_grad():
    x = torch.randint(0, 255, (1, 10), dtype=torch.uint8)
    output = model(x)  # 대부분 int8 기반 연산

float32 연산이 남아 있는 경우?

맞습니다! PyTorch에서는 모든 연산자(operator)가 양자화 가능한 것은 아니기 때문에, 일부 레이어는 float32 연산을 그대로 수행합니다.

양자화된 연산자 vs 미양자화 연산자

연산자 유형	연산 방식
`nn.quantized.Linear`, `Conv2d`	`int8` 기반 연산
일반 `nn.ReLU`, `LayerNorm` 등	`float32` 연산 유지

for name, module in model.named_modules():
    print(name, type(module))

출력 예시:

features.0 <class 'torch.nn.quantized.modules.conv.Conv2d'>
features.1 <class 'torch.nn.ReLU'>  # float 연산

연산 흐름 예시

[int8 input]
↓
[Quantized Conv2d]
↓
[float32 LayerNorm] ← 여기서 float 변환
↓
[Quantized Linear]
↓
[float32 output]

최적화 팁: 연산자 fuse하기

양자화 효과를 극대화하려면 다음을 고려하세요:

torch.quantization.fuse_modules(model, [["conv", "relu"]], inplace=True)

Conv + ReLU, Linear + ReLU 등의 연산을 fuse하여 양자화 가능한 연산자로 전환
LayerNorm, Softmax는 아직 양자화 버전 없음

결론 요약

양자화된 모델은 대부분 int8 연산으로 추론합니다. float으로 되돌리지 않습니다.
일부 연산자는 float32로 유지되며, 이때 int8 ↔ float 변환이 발생합니다.
연산 최적화를 위해 가능한 한 지원되는 연산자만 사용하거나 fuse_modules()를 적극 활용하세요.

Bean

AI developer

이전 포스트

정적 vs 동적 양자화: 모델 용량 차이의 원인

다음 포스트