[2016 ICLR] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Hyungseop Lee·2023년 10월 28일

[Paper Review] Efficient and Scalable

목록 보기

1/30

Paper Information

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
authors : Song Han, Huizi Mao, William J. Dally
subject : ICLR 2016

0. Abstract

Neural network는 computation과 memory 측면에서 집약적이기 때문에
HW resource가 제한되어 있는 embedded system에서 작동시키기에 어려움이 있다.
그래서 우리는 이러한 limitation을 해결하기 위해
deep compression을 소개할 것이다.
deep compression은 3단계 pipeline으로 되어 있다.
1. pruning
2. trained quantization
3. Huffman coding
Pruning :
오로지 important connection들만 학습함으로써 network를 pruning(가지치기)한다.
Quantization :
그 다음으로 weight sharing을 수행하기 위해 weight들을 quantization한다.
Huffman coding :
마지막으로, 우리는 Huffman coding을 적용시킨다.
이러한 3단계의 deep compression을 통해
1. model이 off-chip DRAM memory 보다 on-chip SRAM cache에 적합하도록 할 수 있다.
2. application size와 download bandwidth가 제약된 mobile appilcation에서 complex neural network를 사용할 수 있도록 할 수 있다.
우리는 3~4배 빨라지고 3~7배 energy efficient한 compressed network를
CPU, GPU and mobile GPU에 benchmark하였다.

1. Introduction

최근 Deep neural network는 computer vision task에서 state-of-the-art technique들이 진화되고 있다.
그런데 AlexNet Caffemodel은 weight size가 200MB가 넘고, VGG-16 Caffemodel은 500MB가 넘는다.

우선 첫번째로,
Baidu와 Facebook과 같이 mobile-first company들은 bianry file size에 대해 매우 민감하다.
예를 들어, App Store에서는 100MB 이상의 app은 Wi-Fi 연결할 때까지 download가 안되도록 제한한다.

비록 mobile에서 실행되는 deep neural network가 privacy 향상, network bandwidth 감소 및 real time processing과 같은 훌륭한 기능을 갖고 있지만,
large storage overhead가 deep neural network가 mobile app과 통합되는 것을 막고 있다.
The second issue is energy consumption.
큰 neural network가 작동하는 것은 weight와 dot product에 대한 많은 computation으로 인해 많은 memory bandwidth가 필요한데, 이는 considerable energy를 소비한다고 밝혀졌다.
Mobile device들은 battery constrained되어 있기 때문에, deep neural network와 같은 power가 많이 필요한 application을 배치하기 어렵다.

Energy consumption은 memory access에 지배적이다.
45nm CMOS 기술 하에서,
32bit floating point 덧셈은 0.9 $pJ$ ,
32bit SRAM cache access는 5 $pJ$ ,
반면에 32bit DRAM memory access는 640 $pJ$ 이 소요된다.
Large network는 on-chip storage에 적합하지 않으므로, 비용이 많이 드는 DRAM access가 필요하다.

1 billion connection neural network를 작동시킨다고 가정했을 때,
단지 DRAM access를 위해서 20fps에서 (20Hz)(1G)(640pJ) = 12.8 $W$ 가 필요하다.
이는 일반적인 mobile device의 power envelope를 훨씬 뛰어넘는 power이다.

우리의 목표는 mobile device에 적용할 수 있는 large network의 inference를 수행할 수 있도록 하는 필요한 storage와 energy를 줄이는 것이다.
이 목표를 달성하기 위해, 우리는 deep compression : a three-stage pipeline을 설명한다.
1. 오로지 중요한 informative connection을 유지하기 위해,
  중복되는 connection을 제거함으로써 network를 가지치기 하였다.
2. weight들은 multiple connection이 동일한 weight를 공유하도록 quantize되어있기 때문에,
  오로지 codebook과 index만 저장하면 된다.
3. 효율적으로 biased된 weight distribution에 대해 advantage를 가질 수 있는
  Huffman coding을 적용한다.
우리의 main insight는 pruning과 quantization이 서로 방해 없이 network를 compression할 수 있다는 것이다.
그러므로 high compression rate를 유도할 수 있다.
그것은 요구되어지는 storage를 매우 작게 만들어서 모든 weight들이 energy consuming이 심한 off-chip DRAM 대신에 on-chip에 cache되어질 수 있게 한다.

이러한 deep compression에 기초하여, compressed model에서 동작할 수 있는 the EIE HW accelerator(https://arxiv.org/abs/1602.01528 Han et al.)라는 것이 추후에 제안되었다.

2. Network Pruning

Network pruning은 CNN model을 compress할 수 있는 연구로 널리 알려져 있다.
이전 연구에서, network pruning은 network complexity와 over-fitting을 줄이는 효과적인 방법으로 입증되었다.
최근 accuracy의 손실이 전혀 없는 pruned state-of-the-art CNN model도 있다.
우리는 이러한 approach를 빌렸다.
Figure 1.의 가장 왼쪽에서 볼 수 있듯이,
normal network를 학습하여 connectivity를 학습시킴으로써 시작한다.
다음으로, 우리는 threshold보다 작은 weight들의 모든 connection을 제거하여, small-weight connection을 가지치기 한다.
(아래 예제에서는 weight값이 3 미만인 connection을 모두 제거)
(https://www.youtube.com/watch?v=zfqVBGpf3L4)
마지막으로 우리는 남아있는 sparse connection에 대한 최종적인 weight들을 학습하여 network를 retrain시킨다.
Pruning은 AlexNet의 parameter를 9배 줄이고, VGG-16 model의 parameter를 13배 줄였다.

우리는 pruning의 result인 sparse structre를
CSR(Compressed Sparse Row) or CSC(Compressed Sparse Column) format을 이용하여 저장했다.
CSR이나 CSC format은 $2a + n + 1$ number가 필요했다.
( $a$ : #non-zero elemnts, $n$ : #rows or #columns)
- CSR(Compressed Sparse Row) format :
  (https://www.youtube.com/watch?v=_OhkowiCwu8)
하지만 더욱 더 compression하기 위해,
우리는 absolute position(절대 위치) 대신에 index difference를 저장했고,
conv layer에 대해서 index difference를 8 bits로 encoding,
fc layer에 대해서는 index difference를 5 bits로 encoding했다.
이때, index difference를 저장하기 위해 3bits만 사용한다고 했을 때,
그 차이가 3bits(8) 보다 크다면, zero padding을 추가했다.
예를 들어,
원래의 weight가 16개이고, 0이 아닌 값이 3개가 있다고 가정하자.
이때, index 4에 위치한 weight와 index 15에 위치한 weight가 멀어서(11 > 8)
최대 difference가 8 이하가 될 수 있도록 하기 위해, 중간에 값이 0인 padding을 넣는 방식으로
weight값을 기록할 수 있다.

Network quantization and weight sharing은 또한
각각의 weight represent에 요구되어지는 bits수를 줄임으로써 pruned된 network를 압축할 수 있다.
우리는 multiple connection이 the same weight를 공유하도록 함으로써 effective weights 수를 제한했다.
Weight sharing Figure 3.에서 설명되어진다.

4개의 input neurons과 4개의 output neurons을 갖는 layer를 가정해보자.
왼쪽 위는 4x4 weight matrix이고, 왼쪽 아래는 4x4 gradient matrix이다.
- weight matrix를 4개의 bin으로 quantization시킨다. (4가지 색깔로 clustering됨)
  그러면, 같은 bin에 있는 모든 weight들은 same value를 공유한다.
  그러므로 각 weight들에 대해서, 우리는 오로지 shared weight tabble에 small index만 저장해도 된다.
- update 동안에는 모든 gradient가 color별로 grouped되어지고, group별로 값을 모두 더해준다.
  더한 값에 learning rate를 곱해준 다음 shared centroids로부터 subtracted되어진다.
compression rate를 계산하기 위해,
$k$ 개 cluster가 주어졌을 때, 우리는 index를 encode하기 위해 오로지 $log_2(k)$ bits만 필요하다
일반화하면, a network with $n$ connections and each connection is represented with $b$ bits에 대해서
오로지 $k$ 개의 shared weight만 갖도록 제한한다면 다음과 같은 compression rate가 된다.
Figure 3.로 예를 들면,
실제로는 4x4=16개의 weight였지만 오로지 4개의 shared weights가 있다.
(비슷한 weight들은 모두 같은 값으로 grouping되었기 때문)
처음에 우리는 16개 weight마다 16bits(4 bytes)가 필요 했었지만,
이제 우리는 4개의 effective wieghts를 저장할 필요가 있고, 각각 32 bits를 갖는다.
그리고 각각은 16개의 2-bit indices와 함께
다음의 compression rate를 제공한다.
$\frac{16 * 32}{(4 * 32 + 2 * 16)} = 3.2$

우리는 trained network의 각 layer에 shared weights를 identify하기 위해 k-means clustering을 사용했다.
그렇기 때문에 같은 cluster로 떨어진 모든 weight들은 모두 같은 weight값을 공유할 것이다.
Weight는 layer를 넘어서 공유하지 않는다.
우리는 WCSS(within-cluster sum of squares)를 최소화하기 위해서
$n$ 개의 original weights $W={w_1, w_2, ..., w_n}$ 를 $k$ 개의 cluster $C={c_1, c_2, ..., c_k}$ 로 partition했다.

3.2 Initialization of Shared Weights

Centroid initialization하는 것은 clustering에 영향을 미치기 때문에
network의 prediction accuracy에도 영향을 미치게 된다.
그래서 우리는 3가지 initialization method(Forgy(random), density-based, linear initialization)를 실험해봤다.
Figure 4에서 우리는 AlexNet의 conv3 layer의 original weight's distribution(CDF, PDF)을 plot해봤다.
weights들은 network purning 이후에 bimodal(2 가지 영역의) distribution을 만들었다.
graph 아래쪽에는 (blue, red, and yellow)로된 3가지 서로 다른
initialization methods들에 대한 effetive weights(centroids)가 plot되어 있다.
이 예제에서는 13개의 cluster가 있다.
1. Forgy(random) :
  Forgy(random) initialization은 data set으로부터 관찰한 $k$ 개를 선택하고,
  그것들을 이용해서 initial centroids로 이용하는 방법.
  bimodal distribution에서 2개의 peak가 있기 때문에,
  Forgy method는 그 2개의 peak 주변에 집중되는 경향을 보인다.
2. Density-based :
  Desity-based initialization은 y-axis의 weigth들의 CDF를 선형적으로 배치한 다음,
  CDF와 horizontal intersection(교차점)을 찾고,
  마지막으로 파란 점들에서 보여지듯이 centroid가 되는 x-axis의 vertical intersection을 찾는다.
  이 method는 2개의 peak 주위에 centroid가 더 조밀하게 만들어지지만, Forgy method보다는 산란하다.
3. Linear :
  Linear initialization은 original weights의 [min, max] 사이의 값을 선형적으로 배치한 것이다.
  이 method는 weight의 distribution에 invariant하고 위 2가지 method보다 훨씬 산란적이다.(scattered)
(Han et al., 2015)에 의하면 Larger weight가 smaller weight보다 더욱 중요한 역할을 하지만,
이러한 large weights들은 조금밖에 없다.
그러므로 Forgy initialization과 density-based initialization 둘 다 매우 적은 centroid값이
큰 절대값을 가지므로 이 이 몇가지 큰 weight를 잘 표현하지 못한다.
하지만 Linear initialization은 이러한 문제를 겪지 않는다.
Experiment section에서 clustering과 fine-tuning 후 different initialization method에 따른 accuracy를 비교할 것인데,
Linear initialization이 가장 잘 동작한다.

3.3 Feed-Forward and Back-Propagation

feed forward phase와 weight table을 loopking up하는 phase에서 한가지 간접적인 수단이 있다.
back propagation 과정에서 Loss는 동일한 bin에 담긴 weight들에 대하여 loss를 하나로 묶어 back propagation을 진행한다.
이를 구현하기 위해 indicator function(간접적인 방법)을 사용한다.

4. Huffman Coding

Huffman code는 lossless data compression(무손실 압축)에 일반적으로 사용되는 최적의 prefix code이다.
source symbol을 encode하기 위해 variable-length codewords(가변 길이)를 사용한다.
table은 각 symbol의 occurence probability로부터 유도되어진다.
가장 흔한 symbol은 가장 적은 bits로 표현되어진다.
Figure 5.는
quantized weights의 probability distribution과 AlexNet의 last FC layer의 sparse matrix index를 보여준다.
두 distribution은 baised되어 있다.
(uniform distribution이었다면, Huffman coding의 효과는 미미했을 것이다.)
quantized weight들의 대부분은 두 peak 주변에 분포되어 있으며,
sparse matrix index difference는 20을 넘지 않는다.
Experiment에서는 Huffman coding이
network storage의 이러한 non-uniformly distributed value를
20~30% 절약한다고 보여준다.

(배경지식)

https://www.youtube.com/watch?v=zfqVBGpf3L4&t=979s

Huffman coding :

5. Experiments

우리는 purned, quantized, Huffman encoded된 4가지 network(2개는 MNIST, 2개는 ImageNet을 위한)를 만들었다.
Table 1.에서 pruning before and after에 대한 parameter와 accuracy를 제공한다.
compression pipeline은 network storage를 아무런 accuracy loss 없이 35 ~ 49배를 절약할 수 있었다.
AlexNet의 total size는 240MB에서 6.9MB로 줄었고, 그것은 on-chip SRAM에 넣을 수 있을정도로 충분히 작다.

(여기는 모르겠음)

5.1 LeNet-300-100 and LeNet-5 on MNIST

LeNet-300-100은 2개의 각각 300, 100 neuron으로 2개의 fully connected layer를 갖는 FC network이다.
LeNet-300-100은 Mnist에 대해 1.6% error rate를 달성했다.
LeNet-5는 2개의 convolutional layer와 2개의 fully connected layer를 갖는 CNN이다.
LeNet-5는 Mnist에 대해 0.8% error rate를 달성했다.
Table 2, Table 3에서는 compression pipeline의 통계를 보여준다.
compression rate는 codebook과 sparse indexes의 overhead를 갖고 있다.
pruning과 quantization으로부터 대부분(x32)을 아낄 수 있지만, Huffman coding은 조금밖에 아끼지 못한다(x40).

5.2 AlexNet on ImageNet

Table 4에서 accuracy에 영향이 없도록 35배 줄여서 기존의 2.88%로 만들었다.

5.3 VGG-16 on ImageNet

Table 5에서 accuracy에 영향이 없도록 49배 줄여서 기존의 2.05%로 만들었다.

2개의 Fully-connected layer가 original size 대비 1.6% 미만으로 pruned되었다.
이러한 큰 reduction은 real time image processing에 매우 중요하다.
이러한 reduced layer는 on-chip SRAM에 적합할 것이고, 적절한 bandwidth requirements를 갖게 한다.

6. Discussinos

Pruning and Quantization Working Together

Figure 6에서 pruning과 quantization을 함께 쓰거나 따로따로 쓸 때의 comppression rate에 따른 accuracy를 보여준다.
- pruning only network는 original size의 8% 이하로 내려가기 시작하면서
  accuracy가 급격히 떨어지는 것을 볼 수 있다.
- Quantization only network도 original size의 8% 이하로 내려가기 시작하면서
  accuracy가 급격히 떨어지는 것을 볼 수 있다.
- SVD는 compression rate가 너무 낮다.
- pruning과 quantization을 합쳐서 사용했을 때,
  no loss of accuracy인데 original size를 3%로 압축할 수 있다.
Figure 7에서 3가지 plot(FC layers(left), CONV(middle), and all layers(right))을 그렸다.
각 plot에는
Dash line으로 only quantization에 대한 top-1 and top-5 accuracy를 plot했고,
Solid line으로 quantization and pruning에 대한 top-1 and top-5 accuracy를 plot했다.

6.2 Centroid Initialization

Figure 8에서 각 initialization method별 accuracy를 비교했다.
network는 2~8 bits(x-axis)로 quantize되었다.
- 3 bits quantization 이후부터
  Linear initialization 방법이 density initialization과 random initialization보다 성능이 좋았다.
- linear initialization method는 min value부터 max value까지 initial centroid가 x-axis에 동등하게 퍼져있다.
  large weight가 small weight보다 더 중요한 역할을 하기 때문에
  large weight를 유지하는 데에 도움이 된다.
  이는 network pruning Han et al. (2015)에서도 나온다.
  random initialization과 density-based initialization은 large centroid를 유지하지 않는다.
  이러한 initialization method들을 사용하면, large weight가 거의 없기 때문에 large weight들이 small centroid들로 cluster된다.
  반면에 linear initialization은 large weight가 large centroid를 형성할 수 있는 더 좋은 확률이 있다.

6.3 Speedup and Energy Efficiency

Deep Compression은 mobilde에서 동작하는 latency-focused application에 극도로 target화되어 있다.
그리고 그것은 real-time inference(보행자 detection, 자율주행차의 embedded processor 등)를 필요로 한다.
Batch가 모아질 때까지 기다리는 것은 latency가 증가한다.
따라서 우리는 batch size=1인 경우를 고려하여 performance와 energy efficiency를 benchmarking 했다.
우리는 다른 system 사이에서 power consumption을 비교하기 위해서,
consistent manner (NVIDIA)에서 power를 측정하는 것이 중요했다.
batching이 있을 때와 없을 때의 memory access ratio가 다르다.
만약 input activation을 matrix로 batch화했을 때,
computation은 matrix-matrix multiplication이고, 그것은 blocking함으로써 locality를 향상시킬 수 있습니다.
Matrix는 cache에 맞도록 blocking되어 효율적으로 재사용할 수 있다.
이 경우, the amount of memory access는 $O(n^2)$ 이고, computation은 $O(n^3)$ 이다.

real time processing에서 batching은 허용되지 않는다.
input activation이 single vector이고 computation이 matrix-vector multiplication이라고 했을 때,
the amount of memory access는 $O(n^2)$ 이고, computation은 $O(n^2)$ 이다.
memory access와 computation은 똑같은 크기이다.
That indicates MV is more memory-bounded than MM.
따라서 memory footprint를 줄이는 것이 non-batching case에 더욱 중요하다.
Figure 9는 다른 HW에서 pruning의 speedup을 살펴봤다.
6개의 columns은 각각 the computation time of CPU/GPU/TK1 on dense/pruned network를 보여준다.
batch size=1일 때, pruned network layer dense network의 평균 보다 3~4배 빠르다.
왜냐면 더 적은 memory footprint를 갖기 때문이다.

6.4 Ratio of Weights, Index and Codebook

Pruning은 weight matrix를 sparse하게 만들기 때문에,
나머지 space는 non-zero element에 대한 index를 저장할 필요가 있다.
그리고 Quantization은 codebook에 대한 storage를 필요로 한다.
이전에는 Figure 11에서 100%가 weight였다면, deep compression model에서는
weight가 거의 절반만 차지하고 나머지는 index(non-zero element)와 Codebook이 차지한다.
다른 compression method와 비교하여 실험해봤다.

Neural network는 일반적으로 over-parametrized하다.
그래서 deep learning model에 대한 많은 중복이 있다.
이는 computation과 memory usage 낭비를 초래한다.
그래서 redundancy를 줄이기 위한 많은 연구들이 제안되어 왔다.
Network pruning은 지금까지 network complexity와 overfitting을 줄이기 위해 사용되어 왔다.
최근 연구에서는 large scale network를 성공적으로 pruned하고,
the number of parameters를 10분의 1 크기로 감소시킬 수 있음을 보여주었고,
또한 압축 및 가속에 대한 활성화의 수를 감소시키려는 시도가 있었다.

8. Future Work

pruned network는 다양한 HW에서 benchmark되었지만,
weight sharing을 한 quantized network는 그렇지 않았다.
왜냐하면 cuSPARSE 또는 MKL SPBLAS library가 indirect matrix entry lookup을 지원하지 않았기 때문이다.
또한 CSC or CSR format에 대한 index도 지원하지 않았기 때문이다.
따라서 cache에 맞는 Deep Compression의 장점을 최대화되지 못했다.
SW solution은 이를 지원하는 customized GPU kernel을 제공하는 것이다.
HW solution은 sparse and quantized network structure에 specialized된 ASIC architecture를 구축하는 것이고, 그것은 customized quantization bit width를 지원해야 한다.
우리는 이러한 architecture가 off-chip DRAM access 대신에 on-chip SRAM access에 energy dominated될 수 있다고 예상한다.

9. Conclusion

We have presented Deep Compression that compressed neural networks without affecting accuracy.
Our method operates by pruning the unimportant connections,
quantizing the network using weight sharing,
and then applying Huffman coding.

사전지식

SRAM vs DRAM

RAM :
Random Access Memory.
RAM은 main memory이고, short-term access를 위한 data를 저장한다.
RAM은 volatile memory이다.
2가지 type이 있다.
1. SRAM (Static RAM) :
  SRAM은 6개의 transistor로 구성된다.
  DRAM처럼 capacitor에 전하를 저장하는 것이 아닌, latching회로 (flip-flop)을 사용하여 data를 저장한다.
  지속적인 power가 필요하다.
  비교적 용량 작고, 가격 높고, 속도 빠르다
  cache에 사용되는 On-chip memory임.
2. DRAM (Dynamic RAM) :
  transistor 1개와 capacitor 1개로 구성된다.
  capacitor에 전하가 저장되며, capacitor의 전압이 특정 전압을 달성하면 1bit, 아니면 0bit로 해석된다.
  transistor는 switch 역할을 하며, cell의 capacitor에 전류가 흘러 data를 저장 or 읽을지를 결정한다.
  capacitor는 시간이 지날수록 전하가 방전되어 특정 간격마다 refresh해줘야 한다.
  그렇지 않으면 capacitor에 가해지는 전압이 변해 저장된 data가 손실될 수 있다.
  비교적 용량 크고, 가격 싸고, 속도 느리다
  main memory에 사용되는 Off-chip memory임.

궁금한 점, 모르겠는 점

이러한 3단계의 deep compression을 통해 model이 off-chip DRAM memory 보다 on-chip SRAM cache에 적합하도록 할 수 있다.
➡️ 이것이 주는 좋은 영향이 정확히 무엇인지?
Weight bits와 Index bits가 의미하는 것은?
(내 생각)
weight bits 8 : 모든 weight의 값을 $2^8=256$ level로 만든다. (예를 들어 -0.127 ~ 0.128)
Index bits 4 : 그럼 Total 5.4는 어디서 나온 숫자인가?

비평, 내 생각

아래 빨간색 문장에서 Han et al., 2015를 읽어보진 않았지만
weight 값의 절대값이 큰 weight들이 더 중요한 역할을 한다고 한다.
그렇다면 Linear initialization 방법에서 weight 절대값이 큰 weight들의 비율을 적당히 늘린다면,
더 성능이 좋아지지 않을까? 하는 생각이다.
전체 model에서 weight가 중복되거나 재사용되는 경우가 많다고 했고,
large weight들이 해당 network에서 중요한 역할을 한다고 했다.
그렇다면, accuracy를 결정하는 실질적인 weight들은
전체 weight들 중에는 아주 적게 있을 것이라고 해석된다.
따라서 성능을 결정하는 실질적인 weight들을 pruning(threshold로 날리는 것)말고
다른 방법들을 이용해도 model을 compression할 수 있을 것이라고 생각한다.

그 방법을 연구하는 것이 하나의 주제가 될 수 있을 것인데,
우선 매우 간단히 생각한 아이디어는
학습을 진행하면서 일정 epoch 간격으로 network에서 중요한 역할만 하는 weight들을 모니터링 한다던지,
그 모니터링 범주에 있는 weight들을 udpate할 때, learning rate를 상대적으로 조금 더 크게 준다던지,
그냥 간단히 생각한 아이디어 외에도 많은 아이디어가 있을 수 있을 것 같다.

Seminar - Discussion

과연 "deep compression"의 연산이 빨랐을까?
사실 그렇지 않았다.
sparse matrix는 indirection이 많기 때문에 속도와 성능이 떨어질 수 밖에 없었다.
그런데 속도가 어떻게 더 빠르고, 성능이 더 좋아질 수 있었을까?
따라서 위 figure는 믿을만한 figure가 아니다.

저자는 특별한 환경(AlexNet, VGGNet은 Parameter가 매우 많음)에서 했기 때문에 그 상황에 맞도록 speed, energy efficiency한 것이다.
또한 저자는 SRAM에 모든 parameter들을 넣기 위해 특별한 환경(batch size=1)을 제작하였다.
그리고 압축방법으로 인해 값을 indirection하는 것이 많기 때문에 속도가 떨어질 수 밖에 없었을 것이다.

또한 저자는 training한 후 pruning한 후 retraining했다.
즉, baseline model들과 완전히 동일한 training 환경이 아니었고 retraining과 fine tuning이 추가적으로 많이 진행된 것이기 때문에 성능이 더 좋아졌던 것이다.

참고로 sparse matrix 연산은 compression을 하여 indirection이 많기 때문에 기존의 dense matrix 연산보다 속도가 느리다.
그래서 sparse matrix 연산을 빠르게 할 수 있는 라이브러리들도 NVIDIA에서 제공하고 있다.
(sparse matrix 연산 - cuSPARSE, dense matrix 연산 - cuBLAS)
최근에는 아예 HW적으로 AI chip으로 구현하여 속도의 손해를 안 볼 수 있도록 performance 떨어지는 것을 예방하는 연구, 개발도 있다.

Hyungseop Lee

Efficient Deep Learning

다음 포스트

[2016 ICLR] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

[Paper Review] Efficient and Scalable

Paper Information

0. Abstract

1. Introduction

2. Network Pruning

3.2 Initialization of Shared Weights

3.3 Feed-Forward and Back-Propagation

4. Huffman Coding

(배경지식)

5. Experiments

5.1 LeNet-300-100 and LeNet-5 on MNIST

5.2 AlexNet on ImageNet

5.3 VGG-16 on ImageNet

6. Discussinos

Pruning and Quantization Working Together

6.2 Centroid Initialization

6.3 Speedup and Energy Efficiency

6.4 Ratio of Weights, Index and Codebook

8. Future Work

9. Conclusion

사전지식

SRAM vs DRAM

궁금한 점, 모르겠는 점

비평, 내 생각

Seminar - Discussion

[2017 ICLR] Pruning Filters for Efficient Convnets

0개의 댓글

[2016 ICLR] Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

[Paper Review] Efficient and Scalable

Paper Information

0. Abstract

1. Introduction

2. Network Pruning

3. Trained Quantization and Weight Sharing

3.1 Weight Sharing

3.2 Initialization of Shared Weights

3.3 Feed-Forward and Back-Propagation

4. Huffman Coding

(배경지식)

5. Experiments

5.1 LeNet-300-100 and LeNet-5 on MNIST

5.2 AlexNet on ImageNet

5.3 VGG-16 on ImageNet

6. Discussinos

Pruning and Quantization Working Together

6.2 Centroid Initialization

6.3 Speedup and Energy Efficiency

6.4 Ratio of Weights, Index and Codebook

7. Related Work

8. Future Work

9. Conclusion

사전지식

SRAM vs DRAM

궁금한 점, 모르겠는 점

비평, 내 생각

Seminar - Discussion

[2017 ICLR] Pruning Filters for Efficient Convnets

0개의 댓글