[핵심] [22.08]Optimal Brain Quantizer

YEOM JINSEOP·2024년 8월 5일

2022 LLMs PTQ second-order weight Q

Quantization[논문핵심]

목록 보기

3/11

핵심 아이디어

quantization 했을 때 , 전체 Loss에 최소로 영향을 주는 weight를 quantization하고, 나머지 weight들을 update 한다. (greedy하게 각 row 별로)
위 과정을 각 row의 weight가 모두 quantization 될 때 까지 반복한다.

핵심 수식

전체 Loss: $||\bold{W}_l\bold{X}_l - \hat{\bold{W}_l}\bold{X}_l||^2_2$
where, given a layer $l$ , weights $\bold{W}_l$ , layer inputs $\bold{X}_l$ , quantized weights $\hat{\bold{W}_l}$
$\\$
Hessian: $\bold{H} = 2\bold{X}\bold{X}^T$
각 row에서 quantization할 weight를 고른다.(quantization 시에 Loss에 최소 영향을 주는 weight 선택)
각 row에서 quantization되지 않은 나머지 weight를 update한다.
각 row에서 $H^{-1}$ 에서 $q$ -th row와 $q$ -th column을 제거(Hessian Update)한다.

전체적인 작동 알고리즘

code 구현 시각화 (Visualization)

🟨 아래의 layer를 quantization한다고 가정하자.
Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)

위 Conv2d layer의 weight $W$ 와 Hessian inverse $H^{-1}$ 구하기

여러 row를 parallel하게 (parallel = 32) 연산한다.
- weight quantization 및 quantization error 나머지 weight update
- $H^{-1}$ , $H^{-1}$ diagonal 계산 (이후 $[H^{-1}]_{qq}$ 사용 위함)
Loss에 가장 영향을 적게 미치는 weight를 quantization할 $w_q$ 로 선택

 score = err/diag # score를 계산하고 (Loss에 미치는 영향 계산)
 j = torch.argmin(scores, 1) # 각 row에서 quantize할 column index 하나씩 결정. # (32,)

quantization한 weight 이외의 나머지 weight들을 update

$H^{-1}$ update $H^{−1}$ 에서 $q$ -th row와 $q$ -th column을 제거
위 과정을 각 row의 모든 column들이 quantization 될 때까지 반복

한 parallel당 $O(\text{parallel rows}(32) \times \text{columns})$ 연산 소요.
코드를 보면, 모든 row에서 0을 갖는 column의 경우 연산에서 제외. (가중치가 0이면 quantization 영향이 없기 때문이다.)

YEOM JINSEOP

이전 포스트

[핵심][22.06]ZeroQuant

다음 포스트

[핵심] [22.08]Optimal Brain Quantizer

Quantization[논문핵심]

[핵심][22.06]ZeroQuant

[핵심][22.08]LLM.int8()

0개의 댓글