[논문독파] A Survey of Quantization Methods for Efficient Neural Network Inference

ODD·2024년 9월 2일

Papers

논문

목록 보기

2/2

Basic concepts of Quantization

Uniform Quantization

Q(r) = Int(r/S) - Z

r: a real valued input
S: a real valued scaling factor
Z: an integer zero point

S를 이용해 [-S, S]로 r의 범위를 좁힌 후, Int로 discritize.
Z를 이용해 결과값의 중심을 옮김

r_hat = S(Q(r) + Z)을 이용해 dequantization 근사 가능

Symmetric and Asymmetric Quantization

Uniform quantization은 S를 어떻게 정할 것인지가 매우 중요함
S = (B-a)/2^b-1

[a, B]: a bounded range of real values (clipping range)
b: the quantization bit width

일반적으로 clipping range를 정하기 위해 min/max를 사용함
a = r_min, B = r_max

a = -B일 경우 symmetric, a != -B일 경우 asymmetric인데,
이 때 symmetric을 맞추기 위해 -a = B = max(|r_min| |r_max|)를 사용함

full INT range와 restricted range를 사용할 수 있는데,
full range가 zero point를 0으로 유지하기 때문에 계산상의 이점을 볼 수 있어 더 많이 사용된다.

min/max는 scaling factor를 결정하기 위해 많이 사용되지만 outlier에 매우 취약하다. 이에 따라 percentile을 min/max 대신 이용하는 방식이 제안되었다.
또 다른 방식은 KL divergence를 minimize하도록 a, B를 선택하는 것 (Kullback-Leiber divergence: 어떤 이상적인 분포를 근사하는 다른 분포의 샘플링(quantization)에 따라 발생하는 정보 엔트로피의 차이)
초보를 위한 정보이론 안내서 - KL divergence 쉽게 보기 SERIES 3/3

Range Calibration Algorithms: Static vs Dynamic

언제 이러한 clipping range를 결정할 것인가?

dynamic:
runtime에 inference를 하면서 각 activation을 quantize할 경우 정확한 range를 알 수 있어 high accuracy를 얻을 수 있지만 매우 큰 오버헤드가 있다.

static:
inference 전에 미리 계산된 range를 사용할 경우
오버헤드는 적어지지만 비교적 lower accuracy를 가진다.

MSE나 entropy를 활용해 best range를 찾을 수 있다.

Quantization Granularity

어떤 단위로 Quantization을 수행할 것인가 (scaling factor를 결정할 것인가)?
CNN 같은 경우 필터마다 매우 다른 범위를 가질 수 있다.
따라서 특정 단위 내에서 Quantization을 수행하는 방법이 제안된다.

Layerwise Quantization

한 convolution 레이어 내 모든 필터에 대해 Quantization 수행
drawback: filter마다 range가 매우 다를 수 있어 일반적으로 sub-optimal

Groupwise Quantization

channel을 group하여 Quantization 수행
single conv/activation 내 parameter 분포가 매우 다양할 때 유용
Q-BERT가 transformer를 위해 이 방식을 활용
drawback: 서로 다른 scaling factor를 위한 extra cost

Channelwise Quantization

독립적인 각 channel 내 filter에 대해 fixed value를 사용하는 것
better quantization이면서, 높은 accuracy를 보임
drawback: considerable overhead from the different scaling factors

Sub-channelwise Quantization

groupwise를 channel 단위로 묶는 것

Non-Uniform Quantization

quantization 결과값 간 간격이 uniform하지 않을 수 있음

Q(r) = X_i, if r in [△i,△i+1)

X_i: the discrete quantization levels
△i: the quantization steps (thresholds)

특정 범위들을 정해두고, r이 존재하는 범위에 정해진 값을 지정

고정된 bit-width에서 higher accuracy를 보일 수 있는데, distribution을 더 잘 반영하고 더 중요한 곳에 집중할 수 있기 때문.
ex) 많은 non-uniform quantization methods는 bell-shaped distribution을 위해 디자인 됨 (long tails)

logarithmetic distriution을 이용해 quantization step과 level이 exponential하게 증가하도록하는 방법이 주로 사용됨

Binary-code-based quantization

r in R^n is quantized into m binary vectors by representing r~=Sigma_{i=1..m} a_i*b_i
a_i: the scaling factors (a_i in R)
b_i: in {-1, +1}^n

Optimization-based quantization

최신 연구들은 이러한 non-uniform quantiazation을 optimization problem으로 보고 풀고자함 => min_Q ||Q(r)-r||^2

이에 따라 quantizer 자체를 학습하는 방식도 제안되었는데, 이것을 learnable quantizer라고 부름 (Quantization steps/levels를 train 시킴)

Clustering

rule-based 및 optimization-based non-uniform quantization 외에도 clustring을 쓰기도 함 (k-means, Hessian-weighted k-means)

그러나 이러한 non-uniform quantization들은 general computation hardare에 매핑되기 어려움 => uniform quantization이 더 많이 사용됨

Fine-tuning Methods

Quantization 후에는 paramter 조정이 필요하다

Quantization Aware Training (QAT)

QAT는 forward & backward pass를 quantized model에서 수행하고, 각 gradient 업데이트 후 Quantization을 한 번 더 수행함
Quantized precision을 기반으로 계산을 수행하는 것은 zero-gradient나 high error를 가진 gradient로 이어지므로, floating point로 gradient update(i.e., backward pass)를 수행하는 것은 중요하다!

STE: 어떻게 non-differentiable한 quantization operator를 대할 것인가?

rounding operation (i.e., Int(x))은 piece-wise flat operator
따라서 gradient를 approcimate하는 방법으로 Straight Through Estimator (STE)가 사용됨
STE는 rounding operation을 무시하고 이것을 identity function에 근사함
STE는 일반적으로 잘 동작하지만, binary quantization과 같은 ultra low-precision에서는 잘 동작하지 않음 <- coarse gradient approximation이 모집단의 gradient 평균에 근사될 수 있음??
STE 외의 방법들 (stochastic neuron, combinatorial optimization, target propagation, or Gumbel-softmax)

Non-STE

Remove the needs of differentiability
- regularization operator를 통해 weight가 quantize 되도록 강제 - ProxQuant: rounding operation을 제거하고 W-shape를 사용
pulse training을 이용해 discontinuous points에 대한 derivative를 근사
quantized weight를 affine combination of floating point와 quantized parameter로 대체
AdaRound: round-to-nearest 방법 대신 adaptive rounding을 사용
=> 대부분 많은 tuning을 요구해 STE가 가장 많이 사용됨

How about learning quantization parameters during QAT?

PACT: learns clipping ranges of activations under uniform quantization
QIT: learns also quantization steps\&levels
LSQ: learns scaling factor for non-negative activations (e.g., ReLU)
LSQ+: generalize LSQ (e.g., swish and h-swish) to negative-producing activations

Post-Training Quantization (PTQ)

performs the quantization and the adjustments of weights without any fine-tuning
the overhead of PTQ is very low and often negligible
work in situations where data is limited or unlabeled
drawback: low accuracy (especially for low-precision quantization)

How to mitigate the accuracy degradatation of PTQ?

idea: observe inherent bias in the mean and varaiance of quantized weight values => bias correction methods
idea: equalizing the weight ranges between different layers or channels can reduce quantization errors
ACIQ: analytically computs the optimal clipping range and the channel-wise bitwidth; drawback: hard to efficiently deploy on HW
OMSE: removes channel-wise quantization and optimizes the L2 distance between the Q - FP tensors
Outlier Channel Splitting (OCS): allevate the adverse impact of outliers by duplicating and halving the channels containing outlier values
AdaRound: adaptive rounding method (restrics the changes of the quantized weights to be within +-1)
AdaQuant: generalize AdaRound to allow changes as needed

Zero-shot Quantization (ZSQ)

Level1: No data and no finetuning (ZSQ + PTQ)

idea: equalizes the weight ranges and corrects bias errors <- based on the scale-equivariance property of (piece-wise) linear activation function => sub-optimal for non-linear activations (e.g., BERT with GELU activation, MobileNetV3 ith swih activation)

Level2: No data but requires finetuning (ZSQ + QAT)

idea: generate synthetic data similar to real data (using GAN) and finetune with knowledge distillation from the full-precision counterpart <- fails to capture the internal statistics of the real data (synthesized only using final data)
idea: uses the statistics stored in Batch Normalization
ZeroQ: uses synthetic data for sensitivity measurement => mixed-precision PTQ available

Stochastic Quantization

quantization은 대부분 Int 함수에 의해 deterministic함
즉, 작은 weight update가 일어날 때 rounding operation이 항상 같은 값을 리턴하므로, weight change로 이루어지지 않음
이를 보완하기 위해 확률적으로 weight update를 수행하는 Stochastic Quantization이 제안됨
Int(x) = floor(x) with probability ceil(x)-x or Int(x) = ceil(x) with probability floor(x)-x.
Binary(x) = -1 with probability 1-sigmoid(x) or Binary(x) = 1 with probability sigmoid(x)
QuantNoise: quantizes a different random subset of weights during each forward & backward pass => lower-bit precision quantization without significant accuracy drop!!
- Due to the overhead of creating random numbers for every single weight update, not yet adopted widely in practice

Advanced Concepts: Qunatization Bellow 8-bits

Simulated and Integer-only Quantization

Simulated: quantized as integer, but operation in floating point
Integer-only: both quantization & operation in integer

Simulated가 accuracy degradation은 덜하지만, Integer-only의 이점(power consumption, computation speedup)이 더 크다
idea: Batch Normalization Layer를 이전의 Convolution layer로 fuse하는 방식 <- ReLU에 국한되어 있음 => GELU, Softmax, Layer Normalization을 사용하는 transformer와 맞지 않음

dyadic quantization:
- another class of integer-only quantization
- x/2^n으로 모든 숫자를 표현 => 모든 operation이 integer addition/multiplication/bit shifting이 됨 (no division)
- 모든 addition이 같은 dyadic scale을 가져야함

Mixed-Precision Quantization

uniformly quantize하는 것은 accuracy degradation을 가져온다
각 layer를 서로 다른 bit precision으로 두는 것!
Searching Problem으로 RL이나 Neural Architecture Search (NAS)로 풀기도 함 <- 많은 computational resource를 사용해야하며 hyper parameter/initilization-sensitive함
periodic function regulazriztion을 사용해 자동으로 precision을 정하도록 함
HAWQ: second-order opeartor (i.e., Hessian <- 이계도함수 f'')가 the sensitivity of a layer to quantization을 보여줄 수 있다!
- Optimal Brain Damage의 결과와 유사함
HAWQ2: mixed-precision activation quantization으로 확장
HAWQ3: integer-only, hardware-aware quantization \& find the optimal bit precision
Deep dive into Optimization: Second-order method

Hardware Aware Quantization

모든 하드웨어가 quantized model에 대해 같은 speedup을 보여주지는 않는다 (hardware-dependant): on-chip memory, bandidth, an cache hierarchy
=> RL 등을 사용해 deployment를 고려한 Hardware-aware mixed-precision quantization이 제안되고 있음

Distillation-Assisted Quantization

Quantization accuracy를 위해 model distillation을 사용할 수 있다!
Model distillation: student model을 학습할 때 ground-truth class label 대신 teacher가 생성한 soft probabilities를 사용 => 더 많은 input 정보를 포함할 수 있음
loss: L = aH(y, softmax(z_s)) + BH(softmax(z_t, T), softmax(z_s, T))
(H: cross-entropy, z_x: logits generated by x, T: temperature)
[논문 리뷰] Distilling the Knowledge in a Neural Network
ideas: knowledges from intermediate layer/multiple teachers/itself

Extereme Quantization

Binarization and Ternarization

Binarization: quantized value들이 1-bit인 것 => 32x 메모리 오버헤드를 줄일 수 있음! (most extreme)

binary (1-bit) 및 ternary (2-bit) operation은 bit-wise XNOR->bit-counting으로 효율적으로 계산이 가능함
drawback: significant accuracy degradation
BinaryConnect:
- weight를 +1 또는 -1로 가짐
- forward pass에서는 weight를 real-value로, backward에서는 binarize함
- STE를 사용해서 propagate
BNN: extends this idea by binarizing weights as well as the weights
Binary Weight Network (BWN) / XNOR-Net: weight에 대한 scaling factor로 +1, -1 대신 +a, -a를 사용하여 higher accuracy를 달성함 <- a, B = argmin ||W-aB||^2 (W: wegiths, B: binarized weights)
많은 weight가 zero에 가깝기 때문에 weight/activation을 +1, 0, -1로 표현하는 ternarization이 제안됨 (Binarization과 동일하게 Bit-wise opeartion 사용 가능)
Ternary-Binary Network (TBN): binary network weights와 ternary activation을 사용하는 것이 optimal tradeoff를 가져올 수 있다는 연구

Solutions for Accuracy Degradation

extreme quantization에서 accuracy degradation이 심각하게 일어남
Quantization Error Minimization
HORQ / ABC-Net: linear combination of multiple binary matrices i.e., W ~= a1B1 + ... + amBm

Performance Guaranteed Network Acceleration via
High-Order Residual Quantization

Towards Accurate Binary Convolutional Neural
Network
Improved Loss Function

Loss-aware binarization and ternarization (optimize the loss w.r.t. the binarized/ternarized weights)
Knowledge distillation도 사용됨

Improved Training Method

STE는 gradients가 [-1,1] 범위 내에 있을 경우에만 propagete를 수행함
BNN+: sign 함수의 도함수에 대한 continuous approximation을 도입
idea: sign 함수를 smooth, differentiable한 함수로 대체하는 방법도 제시됨
Bi-Real Net: activation과 activation을 연결하는 identity shortcut을 도입
Bi-Real Net 리뷰
DoReFa-Net: gradient를 quantize해서 학습 속도를 가속화

Vector Quantization

ML 이전의 타 분야에서의 Quantization은 minimum error를 목표로 하지만, ML에서는 small loss를 가지는 reduced-precision representation을 찾는 것이 목표라는 차이점 존재
따라서 목표만 달성할 수 있다면 기존 weight/activation에서 크게 차이가 나도 된다
idea: weight를 clustering한 후, centroid 를 quantized value로 사용하는 방식이 제안됨
- min_{c1...ck} Sigma_i ||w_i - c_j||^2 <- weight 차이를 최소화하는 centroid 탐색
=> k-means를 통해 모델을 accuracy degradation이 거의 없이 최대 8배 줄일 수 있음을 확인
k-means based vector quantization with pruning과 Huffman coding을 결합하면 더욱 모델을 줄일 수 있음
Product Quantization: vector quantization을 확장하여 weight matrix를 submatrices로 나눈 뒤 각 submatrix에 대해 vector quantization을 진행
Vector Quantization과 Codebook 개념 정리
[논문리뷰] Autoregressive Image Generation using Residual Quantization (RQ-VAE-Transformer)

Quantization and Hardware Processors

Edge device들은 compute/memory/power가 모두 제한되어 있으며, floating point operation을 미지원하기도 하므로 quantization이 필수적이다
ARM Cortex-M: floating-point unit이 없어 quantization이 필요하다. CMSIS-NN 라이브러리는 NN model에 fixed-point quantization을 적용할 수 있도록 한다
GAP-8: 자체적인 CNN accelerator로 edge inference를 지원하는 RISC-V SoC로, integer arithmetic만을 지원한다
Google Edge TPU: a purpose-built ASIC chip. Google의 Cloud TPU와 달리 작은 크기와 파워를 사용하며, 8-bit arithmetic만을 지원한다
Quantization은 이러한 edge processor에도 용이하지만, NVIDIA Turing GPU(e.g., T4)에서 low-precision matrix multiplication을 위한 Tensor core들에도 적합하다

Future Directions for Research in Quantization

Quantization Software

많은 INT8 quantized model을 위한 software(Nvidia's TensorRT, TVM, ...)가 존재함
그러나 lower bit-precision quantization에 대해서는 아직 존재하지 않음

Hardware and NN Architecture Co-Design

QAT를 통해 다른 solution을 찾더라도, 좋은 accuracy를 가질 수 있음
즉 모델 변경이 용이할 수 있음 (Prev work에 의하면, NN architecture의 with를 변경하는 것이 accuracy 향상에 도움이 됨)
이를 이용해 FPGA deployment등을 위해 possible hardware configurations와 NN architecture 및 quantization co-design을 활용하는 연구를 할 수 있을 것

Coupled Compreession Methods

Efficient NN arch design, co-design of HW and NN arch, pruning, knowledge detection 등은 상호배반적이지 않으므로 서로 결합하여 최적의 솔루션을 찾는 연구가 가능

Quantized Training

half-precision까지는 매우 좋은 결과가 나왔지만, INT8 밑으로 가기가 어려움
많은 연구가 제시되었지만, 일반적으로 hyperparameter tuning을 요구함
핵심 문제는 INT8 아래 precision의 QAT가 unstable하고 diverge하기 쉽다는 것
이 challenge를 다뤄야할 것

ODD

이전 포스트