딥러닝 Quantization(양자화) 정리

Quantization(양자화)이란 무엇인가?

  • 모델의 파라미터를 lower bit로 표현함으로서 계산과 메모리 access 속도를 높이는 경량화 기법
    • 보통 32비트 부동소수점 연산을 8비트 정수로 변환하는 방식 사용
      - pytorch, tensorflow의 default data type = fp32
  • Quantization 기법 종류

    1. Post Training 된 모델을 quantization 하는 Post Training Quantization
    - Traning한 후에 quantize를 적용하는 기법. 파라미터 size 큰 모델에서 정확도 하락 폭이 작으며 파라미터 size 작은 소형 모델에서는 적합하지 않음.
    - Dynamic Quantization, Static Quantization 방법 등이 있음.
    2. 학습을 통한 quantization을 simulate 하는 Quantization Aware Training
    - traning 과정 중에서 quantize 수행. Fake quantization node를 첨가하여 quantize되었을 시 어떻게 동작할지 시뮬레이션.
  • 장점
    • 모델 사이즈 축소 (저장 용량 효율화)
    • RAM 메모리 bandwith 절약
    • 추론 속도 개선
    • 전력 소비 효율화
  • 단점
    • model accuracy가 저하됨.
  • 양자화 방법별 경량화 정도 및 성능 정리 (tflite 기준 장표)
  • Quantization 예시 코드 (pytorch)
    import torch.quantization
    quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
  • 참고 자료

Quantization 기법 소개

1. Dynamic Quantization (동적 양자화)

  • 가장 간단한 양자화 기법
  • 모델의 가중치(weight)에 대해서만 양자화 진행
  • 활성화(activations)는 추론할 때 동적으로 양자화
    • activations는 메모리에 부동소수점 형태로 read, write 됨.
    • inference시에만 floating-point kernel를 이용해여 weights를 int8을 float32로 convert됨. Activation은 항상 floating point로 저장되어져 있습니다. 그래서 quantized kernels processing을 지원하는 operator의 경우에는 activation을 processing전에 dynamic하게 8bit로 quantized하고 processing후에 다시 dequantization하게 됩니다.
    • weights들은 training 후에 quantize
    • activations은 inference time에 dynamic하게 quantized
  • 모델을 메모리에 로딩하는 속도 개선
  • 연산속도 향상 효과 미비 (inference kernel 연산이 필요하기 때문)
  • CPU 환경에서만 inference 가능 (프레임워크나 프레임워크의 버전에따라 GPU 환경에서도 동작할 순 있음)
    • With PyTorch 1.7.0, we could do dynamic quantization using x86-64 and aarch64 CPUs. However, NVIDIA GPUs have not been supported for PyTorch dynamic quantization yet.
  • 모델의 weights를 메모리에 loading하는 것이 execution time에 큰 영향을 미치는 BERT와 같은 모델에 적합
  • 예시
    # create a model instance
    model_fp32 = M()
    # create a quantized model instance
    model_int8 = torch.quantization.quantize_dynamic(
        model_fp32,  # the original model
        {torch.nn.Linear},  # a set of layers to dynamically quantize
        dtype=torch.qint8)  # the target dtype for quantized weights
  • Diagram
# original model
# all tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# dynamically quantized model
# linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

2. Static Quantization / Post-Training Static Quantization (정적 양자화)

  • 모델의 가중치와 활성화(activations) 모두 양자화를 사전에 진행

    • 가중치와 활성화 fusion
    • calibration하는 동안 활성화가 설정됨
      • *calibration: 직역하면 눈금 매김 (미세조정 같은 것으로 이해)
  • static quantization의 활성화 quantization을 위해 activation의 preceding layer와 fusion 수행
    - fusion = 각각의 기능을 수행하는 layer를 하나로 합침.
    - activation, convolution 등 layer를 합쳐 layer 간에 데이터 이동으로 발생하는 context switching overhead를 줄일 수 있음.
    - sequential하게 처리되던 연산을 병렬도 처리할 수 있다는 장점도 있음.

    - [Conv, Relu], [Conv, BatchNorm], [Conv, BatchNorm, Relu], [Linear, Relu] 등과 같은 fusion이 있음.

  • 정확도 손실을 최소화하기 위해 calibration으로 미세조정

    • calibration으로 range 설정을 조절할 수 있음.
    • histogram에따른 calibration 등
    • 대표 데이터셋을 통해 calibration 조정
  • 연산속도 향상

  • tflite는 CPU, GPU 환경에서 추론 가능 / pytorch는 CPU만 가능

  • Pytorch 예시

    import torch
    
    # define a floating point model where some layers could be statically quantized
    class M(torch.nn.Module):
        def __init__(self):
            super(M, self).__init__()
            # QuantStub converts tensors from floating point to quantized
            self.quant = torch.quantization.QuantStub()
            self.conv = torch.nn.Conv2d(1, 1, 1)
            self.relu = torch.nn.ReLU()
            # DeQuantStub converts tensors from quantized to floating point
            self.dequant = torch.quantization.DeQuantStub()
    
        def forward(self, x):
            # manually specify where tensors will be converted from floating
            # point to quantized in the quantized model
            x = self.quant(x)
            x = self.conv(x)
            x = self.relu(x)
            # manually specify where tensors will be converted from quantized
            # to floating point in the quantized model
            x = self.dequant(x)
            return x
    
    # create a model instance
    model_fp32 = M()
    
    # model must be set to eval mode for static quantization logic to work
    model_fp32.eval()
    
    # attach a global qconfig, which contains information about what kind
    # of observers to attach. Use 'fbgemm' for server inference and
    # 'qnnpack' for mobile inference. Other quantization configurations such
    # as selecting symmetric or assymetric quantization and MinMax or L2Norm
    # calibration techniques can be specified here.
    model_fp32.qconfig = torch.quantization.get_default_qconfig('fbgemm')
    
    # Fuse the activations to preceding layers, where applicable.
    # This needs to be done manually depending on the model architecture.
    # Common fusions include `conv + relu` and `conv + batchnorm + relu`
    model_fp32_fused = torch.quantization.fuse_modules(model_fp32, [['conv', 'relu']])
    
    # Prepare the model for static quantization. This inserts observers in
    # the model that will observe activation tensors during calibration.
    model_fp32_prepared = torch.quantization.prepare(model_fp32_fused)
    
    # calibrate the prepared model to determine quantization parameters for activations
    # in a real world setting, the calibration would be done with a representative dataset
    input_fp32 = torch.randn(4, 1, 4, 4)
    model_fp32_prepared(input_fp32)
    
    # Convert the observed model to a quantized model. This does several things:
    # quantizes the weights, computes and stores the scale and bias value to be
    # used with each activation tensor, and replaces key operators with quantized
    # implementations.
    model_int8 = torch.quantization.convert(model_fp32_prepared)
    
    # run the model, relevant calculations will happen in int8
    res = model_int8(input_fp32)
  • Diagram

    # original model
    # all tensors and computations are in floating point
    previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                        /
        linear_weight_fp32
    
    # statically quantized model
    # weights and activations are in int8
    previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                        /
      linear_weight_int8
  • 활성화가 inference에 영향이 큰 CNN 모델에 적합

  • 참고자료

3. Quantization aware training

  • 모델의 가중치와 활성화를 학습하면서 양자화

  • fake-quantization modules/nodes를 양자화가 되는 부분에 위치시킴.

  • Fake-quantization은 clamping과 rounding을 수행

    • *clamping = 데이터 범위
  • quantization-aware training 중에 활성화 함수(activation)의 실제 출력 범위(최대/최소) 확인도 진행

  • quantization aware trainingdl 끝나면, fake-quantization modules에 저장된 정보를 이용하여, floating point 모델이 integer 모델로 변경할 수 있음.

    • Once the quantization aware training is finished, the floating point model could be converted to quantized integer model immediately using the information stored in the fake quantization modules.
  • Dynamic, Static Quantization 보다 높은 accuracy 확보

  • 학습은 CPU, GPU 환경에서 사용 가능 / 추론은 CPU에서만 가능

  • dynamic, static quantization으로 성능이 나오지 않는 CNN 모델에서 활용

  • pytorch 예시

    import torch
    
    # define a floating point model where some layers could benefit from QAT
    class M(torch.nn.Module):
        def __init__(self):
            super(M, self).__init__()
            # QuantStub converts tensors from floating point to quantized
            self.quant = torch.quantization.QuantStub()
            self.conv = torch.nn.Conv2d(1, 1, 1)
            self.bn = torch.nn.BatchNorm2d(1)
            self.relu = torch.nn.ReLU()
            # DeQuantStub converts tensors from quantized to floating point
            self.dequant = torch.quantization.DeQuantStub()
    
        def forward(self, x):
            x = self.quant(x)
            x = self.conv(x)
            x = self.bn(x)
            x = self.relu(x)
            x = self.dequant(x)
            return x
    
    # create a model instance
    model_fp32 = M()
    
    # model must be set to train mode for QAT logic to work
    model_fp32.train()
    
    # attach a global qconfig, which contains information about what kind
    # of observers to attach. Use 'fbgemm' for server inference and
    # 'qnnpack' for mobile inference. Other quantization configurations such
    # as selecting symmetric or assymetric quantization and MinMax or L2Norm
    # calibration techniques can be specified here.
    model_fp32.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
    
    # fuse the activations to preceding layers, where applicable
    # this needs to be done manually depending on the model architecture
    model_fp32_fused = torch.quantization.fuse_modules(model_fp32,
        [['conv', 'bn', 'relu']])
    
    # Prepare the model for QAT. This inserts observers and fake_quants in
    # the model that will observe weight and activation tensors during calibration.
    model_fp32_prepared = torch.quantization.prepare_qat(model_fp32_fused)
    
    # run the training loop (not shown)
    training_loop(model_fp32_prepared)
    
    # Convert the observed model to a quantized model. This does several things:
    # quantizes the weights, computes and stores the scale and bias value to be
    # used with each activation tensor, fuses modules where appropriate,
    # and replaces key operators with quantized implementations.
    model_fp32_prepared.eval()
    model_int8 = torch.quantization.convert(model_fp32_prepared)
    
    # run the model, relevant calculations will happen in int8
    res = model_int8(input_fp32)
  • Diagram

    # original model
    # all tensors and computations are in floating point
    previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                          /
        linear_weight_fp32
    
    # model with fake_quants for modeling quantization numerics during training
    previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                               /
       linear_weight_fp32 -- fq
    
    # quantized model
    # weights and activations are in int8
    previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                         /
       linear_weight_int8
  • 참고자료

Quantization 모델 실행 환경

Today, PyTorch supports the following backends for running quantized operators efficiently:

x86 CPUs with AVX2 support or higher (without AVX2 some operations have inefficient implementations)

ARM CPUs (typically found in mobile/embedded devices)

At the moment PyTorch doesn’t provide quantized operator implementations on CUDA - this is the direction for future work. Move the model to CPU in order to test the quantized functionality.

However, quantization aware training occurs in full floating point and can run on either GPU or CPU. Quantization aware training is typically only used in CNN models when post training static or dynamic quantization doesn’t yield sufficient accuracy. This can occur with models that are highly optimized to achieve small size (such as Mobilenet).

Quantization approach 선정

Quantization 지원 프레임워크

  • tflite (tensorflow lite)
    converter = tf.lite.TFLiteConverter.from_keras_model(keras_model)
  • pytorch API
    • torch.quantization.* 라이브러리 활용
profile
Yes, Code Wins Arguments!!

0개의 댓글