[24.ICLR] RETHINKING CHANNEL DIMENSIONS TO ISOLATE OUTLIERS FOR LOW-BIT WEIGHT QUANTIZATION OF LARGE LANGUAGE MODELS

YEOM JINSEOP·2024년 9월 6일

Quantization[논문]

목록 보기

9/9

"activation outlier들을 다루는 방법으로, 기존의 output-channel(OC) 방향 대신, input-channel(IC) 방향으로 weight를 grouping"
"최종적으로, 각 layer에서 per-IC를 쓸지 또는 per-OC Quantization을 쓸지 결정하는, AdaDim(Adaptive Dimension) 방법 제안"

small batch inference settings에서 (예를 들어 mobile device) LLM을 serving할 때 large memory bottleneck이 문제가 됨.
4-bit 이하 Weight-only quantization은 large-magnitude activation outliers로 인해 challenge로 남아있음.
observation
activation outliers affect the input dimension of the weight matrix,
so similarly grouping the weights in the IC direction can isolate outliers within a group.
기존의 per-output-channel 대신, per-input-channel (IC) 안에서 quantization group을 만드는 방법을 제시.

large batch size
- computing이 bottleneck
- INT8 quantization이 효과적.
small batche size
- memory가 bottleneck
- INT8 quantization이 효과적 X.
weight-only quantization
- memory bottleneck을 다루기 위한 방법
- activation은 high precision(e.g., FP16)으로 냅두고, weights를 더 낮은 (4 bits 이하)로 내린다.
small batch inputs은 modern GPUs의 powerful한 compute capacity로 충분히 커버 되기 때문에,
이 논문에선, compute보단 memory I/O를 가속하기 위해, weight-only quantization에 집중한다.

activation outlier가 modern LLMs에서 prevalent하긴 하지만, 모든 layer에서 나타나지는 않음.
weight sensitivity
- 아래 논문을 따라 fisher information을 사용함.
- Memory Efficient Fine Tuning of Compressed Large Language Models via Sub 4 bit Integer Quantization, (Kim, 2023b)
- calibration set을 사용해서 gradient의 제곱으로 approximation.
largest activation 출몰 지역
- before, QKV attention proj
- before, DOWN FFN
activation outliers가 발생하는 hidden dimensions는 weight chanels가 sensitive rows를 갖도록 하는 correlation을 가짐.
activation outlier가 존재하지 않는 경우, weight matrix는 mixture of sensitive IC, OC channel들을 가질 수 있음. network depth에 걸쳐 바뀌기도 함.
- 따라서, 각기 다른 sensitivity에 apdat하게 적용할 수 있는 weight Q가 필요함.

기존 per-channel Quantization의 limitation
- 대부분 per-OC (per-Output Channel) Q 사용.
  activation outlier 발생 시, amplification effect가 모든 Q group에 퍼짐

Per-IC Q의 장점
- 각 IC (Input Channel) 내에서 그룹화
- Hidden dimension과 Q group 간 1:1 mapping 생성
- outlier effect를 group 안으로 격리

Per-IC 양자화의 효과 검증
- 표준 RTN 방법에 per-IC 양자화 적용
- activation outlier의 영향을 받는 모듈에 per-IC Q 사용
- 결과: LLM의 perplexity와 multi-task in-context learning 능력 향상

Adaptive Q의 중요성
- 모든 layer에 무분별하게 적용 시 성능 저하 (44.54에서 44.38로)
- QKV와 DOWN 모듈에 선택적 적용 시 MMLU 점수 평균 0.67% 향상
Optimization objective
- OC 또는 IC dimension으로 할지 optimization parameter dim을 선택하는 간단한 binary selection problem으로 문제를 formulation함.
- measure: reconstruction error metric
  - $Q_{\text{dim}}:$ per-OC(standard), per-IC(proposed)
- $\bold{X}$ 를 얻기 위해, pretraining corpus(e.g. The Pile)에서 random하게 sampling한 small calibration set을 사용했다.
- dimension parameter의 search space가 오직 두 개이기 때문에, optimal dimension을 결정하기 위한 forward pass의 수가 매우 작다.
RTN과 GPTQ Augmenting
- RTN으로 per-IC, per-OC 둘 중 하나로 dimension 결정.
  - full precision weights를 per-IC, per-OC로 independent하게 Q한 후, reconstruction error가 작은 dimension을 선택한다.
- optional하게 seleceted dimension에 GPTQ를 적용.
  - GPTQ는 1) weight channel의 Q error 계산,
    2) hessian-based weight updates 적용
  - 1) 과정에서 per-IC RTN 사용하면됨.