# [기술] Knowledge Distillation

안규원·2024년 6월 30일

Knowledge distillation

AI

목록 보기

20/22

[Abstract]

지식 증류( Knowledge Distillation)는 대규모 Teacher 모델의 지식을 소규모 Student 모델로 전이하여 성능을 유지하면서도 효율성을 높이는 기법입니다.

이번 실험에서는 MNIST 데이터셋을 사용하여 Teacher와 Student 모델을 학습시키고, 두 모델의 학습 시간, 예측 소요 시간, 그리고 성능을 비교했습니다.

데이터셋으로는 CNN에서 자주 사용되는 MNIST 데이터셋을 사용했습니다.
(각각 train에 60,000개/test에 10,000개의 이미지)

결론부터 보시면

	Teacher	Student
학습 소요시간	94.4465 seconds	72.3704 seconds
추론 소요시간	2.0006 seconds	1.9377 seconds
추론 정확도	95.85%	96.13%

Teacher에 비해 fc layer와 ReLU layer가 하나씩 적고, 이에 따라 학습에 소요시간이 94초->72초로 단축되었습니다. 성능은 오히려 향상되었는데 이는 모델이 워낙 작아서 성능에 큰 차이가 없다고 해석됩니다. MNIST 이미지가 2828 픽셀이라 input fc layer는 2828을 입력되고, 10개의 클래스를 분류하기 때문에 output fc layer 는 10으로 출력합니다.

Teacher 모델은

fc layer(28*28 -> 400)
ReLU layer
fc layer(400 -> 100)
ReLU layer
fc layer(100 -> 10)

Student 모델은

fc layer(28*28 -> 100)
ReLU layer
fc layer(100 -> 10)

[정의]

In machine learning, knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized. It can be just as computationally expensive to evaluate a model even if it utilizes little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller model without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device).

대규모 신경망 모델Teacher Model의 성능을 유지하면서 더 작은 모델Student Model로 지식(knowledge)을 전달하는 기법

이 과정은 주로 모델의 효율성을 높이고, 경량화된 모델을 배포하거나 실시간 애플리케이션에 적용할 때 사용되며, 특히 제한된 리소스 환경에서 유용

[구성요소]

Teacher Model
대규모로 학습된 모델로, 높은 성능만큼 계산 비용이 큼.
Student Model
Teacher Model보다 상대적으로 작은 모델로, Teacher Model의 지식을 전달받아 학습
Soft Targets
Teacher Model이 예측한 확률 분포로, Student Model이 학습할 때 사용됨
Soft Targets는 단순한 라벨 대신 Teacher Model의 출력을 사용
이미지 클래스 분류와 같은 task는 신경망의 마지막 softmax 레이어를 통해 각 클래스의 확률값을 출력하게 되는데, 예측한 클래스 이외의 값도 soft하게 만들어 확률분포 추출

[과정]

Teacher Model 학습
큰 모델을 충분히 학습시켜 높은 성능 확보
Soft Targets 추출
Teacher Model을 통해 입력 데이터에 대한 Soft Targets를 생성
Teacher Model의 출력 확률분포를 의미
KL Divergence(Kullback-Leibler Divergence)
두 확률 분포 P와 Q 간의 차이를 측정하는 비대칭적인 측도