[AMD DL Study] 12주차 Vitis AI tutorials(2)

Loloh_the_great·2024년 1월 22일

CNN FPGA Vitis-AI

AMD-DL

목록 보기

10/12

Refence : https://github.com/Xilinx/Vitis-AI-Tutorials/tree/3.5/Tutorials/RESNET18/

목차.

Review
Vitis AI tutorials : CIFAR10 Dataset
Summary

1. Review

이전 주차에서는 Vitis AI를 wsl에서 설치하고 docker를 통해서 실행하는 것까지 진행하였다. 이번 주차는 Cifar10 데이타셋을 통해서 fpga 보드에 RESNET 18이라는 CNN(convlutional neural network)를 가동시키는 방법에 대해서 배울 예정이다.

2. Vitis AI tutorials : CIFAR10 Dataset

먼저 CIFAR10이라는 데이타셋을 학습하고 양자화하는 등의 과정을 Vitis AI를 통해서 진행할 것이다.

CIFAR10 : 10 개의 class를 랜덤으로 가지고 60000 개의 라벨링 된 32x32 size RGB 이미지를 가진 데이타셋이다.

2.1 Train ResNet18 CNN on CIFAR10

실행 환경은 전에 만들어 둔 conda vitis-ai-tensorflow2에서 진행한다.

이 챕터에서는 RESNET 18을 Vitis AI로 학습을 시켜볼 것이다. 학습은 /workspace/tutorials/RESNET18/files 디렉토리에 있는 run_all.sh 파일을 통해서 진행이 된다.

학습을 위한 command line은 다음과 같다.

cd /workspace/tutorials/RESNET18/files # your current directory
source run_all.sh run_clean_dos2unix
source run_all.sh cifar10_dataset
source run_all.sh run_cifar10_training

위의 커맨드를 shell에 실행을 하면 이러한 메시지가 나오며 오류가 실행이 된다.

a. 첫 번째 명령어

b. 두 번째 명령어

c. 세 번째 명령어

===========================================================================
WARNING:
  'run_all.sh' MUST ALWAYS BE LAUNCHED BELOW THE 'files' FOLDER LEVEL
  (SAME LEVEL OF 'modelzoo' AND 'target' FOLDER)
  AS IT APPLIES RELATIVE PATH AND NOT ABSOLUTE PATHS
===========================================================================


----------------------------------------------------------------------------------
[DB INFO STEP3A] CIFAR10 TRAINING (way 1)
----------------------------------------------------------------------------------

2024-01-22 03:54:42.374293: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-22 03:54:44.925723: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "./code/train1_resnet18_cifar10.py", line 51, in <module>
/workspace/tutorials/RESNET18/files
    from classification_models.keras import Classifiers
ModuleNotFoundError: No module named 'classification_models'
mv: cannot stat './build/float/train1_best_chkpt.h5': No such file or directory
mv: cannot stat './build/float/train1_final.h5': No such file or directory

----------------------------------------------------------------------------------
[DB INFO STEP3B] CIFAR10 TRAINING (way 2)
----------------------------------------------------------------------------------

2024-01-22 03:54:50.999317: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-22 03:54:52.088385: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "./code/train2_resnet18_cifar10.py", line 63, in <module>
/workspace/tutorials/RESNET18/files
    from classification_models.keras import Classifiers
ModuleNotFoundError: No module named 'classification_models'

영문을 모르겠다. 혹시 몰라서 ./build/float/train1_best_chkpt.h5 메시지에 나온대로 train1_best_chkpt.h5 파일을 추가했으나 결과는 같았다.

일단은 공통으로 나오는 이 오류 때문인 것 같은데

Traceback (most recent call last):
/workspace/tutorials/RESNET18/files
  File "./code/train1_resnet18_cifar10.py", line 51, in <module>
    from classification_models.keras import Classifiers
ModuleNotFoundError: No module named 'classification_models'
mv: cannot stat './build/float/train1_best_chkpt.h5': No such file or directory
mv: cannot stat './build/float/train1_final.h5': No such file or directory

pip install tensorflow 등의 명령어를 넣어 봤지만 작동하지 않았다. 굉장히 당황스럽다. 일단은 미리 만들어 둔 train1_best_chkpt.h5 을 그대로 넣어두었다. 위의 오류는 인터넷에 검색해 보니 저번 포스트에 설치한 image classifier가 없다는 것 같은데, 그것 때문일까?

해결

pip install image-classifier 명령어를 docker container 내부의 conda 환경에서 다시 설치를 해준다.

다시

source run_all.sh run_cifar10_training

명령어를 입력해주니 아래와 같이 학습이 진행되었다.

[DB INFO] Training the Model...

Epoch 1/50
2024-01-22 19:59:01.578921: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32
         [[{{node Placeholder/_0}}]]

Epoch 1: val_accuracy improved from -inf to 0.10341, saving model to build/float/train1_best_chkpt.h5
195/195 - 992s - loss: 1.0440 - accuracy: 0.6433 - val_loss: 2.9902 - val_accuracy: 0.1034 - lr: 0.0100 - 992s/epoch - 5s/step
Epoch 2/50

Epoch 2: val_accuracy improved from 0.10341 to 0.13425, saving model to build/float/train1_best_chkpt.h5
195/195 - 931s - loss: 0.6689 - accuracy: 0.7664 - val_loss: 3.3780 - val_accuracy: 0.1343 - lr: 0.0098 - 931s/epoch - 5s/step
Epoch 3/50

이렇게 학습한 결과가 build/float/train1_best_chkpt.h5 경로에 저장된다고 한다. 현재 cpu 환경에서 작업 중인데 한 회차에 996초 정도 걸린다. 50 개의 epoch이면 약 14시간 정도 걸리니 너무 오래 걸린다.

14시간이 지난 후 train1_resnet18_cifar10.py 이 파일에 나온 것 처럼

vitis-ai-user@docker-desktop:/workspace/tutorials/RESNET18/files/build/float$ ls
train1_resnet18_cifar10_best.h5  train1_resnet18_cifar10_final.h5  train2_resnet18_cifar10_float.h5

이러한 3 개의 파일이 만들어진 것을 확인하였다.

이제 /workspace/tutorials/RESNET18/files/build/log 에서 train1_resnet18_cifar10.log라는 log 파일을 열어서 학습이 어떻게 진행이 되었는지 파악할 수 있다. 학습 결과는 아래와 같다.

Epoch 48: val_accuracy did not improve from 0.86205
195/195 - 929s - loss: 0.0409 - accuracy: 0.9861 - val_loss: 0.6335 - val_accuracy: 0.8551 - lr: 6.0000e-04 - 929s/epoch - 5s/step
Epoch 49/50

Epoch 49: val_accuracy did not improve from 0.86205
195/195 - 926s - loss: 0.0410 - accuracy: 0.9869 - val_loss: 0.6302 - val_accuracy: 0.8567 - lr: 4.0000e-04 - 926s/epoch - 5s/step
Epoch 50/50

Epoch 50: val_accuracy did not improve from 0.86205
195/195 - 928s - loss: 0.0384 - accuracy: 0.9874 - val_loss: 0.6283 - val_accuracy: 0.8579 - lr: 2.0000e-04 - 928s/epoch - 5s/step


Elapsed time for Keras training (s):  47923.679135

위와 같이 최종적으로 98.7 %의 정확도를 보여주었다.

              precision    recall  f1-score   support

    airplane       0.90      0.89      0.90       500
  automobile       0.88      0.95      0.92       500
        bird       0.83      0.83      0.83       500
         cat       0.76      0.72      0.74       500
        deer       0.84      0.84      0.84       500
         dog       0.85      0.73      0.78       500
        frog       0.83      0.94      0.88       500
       horse       0.88      0.88      0.88       500
        ship       0.92      0.91      0.92       500
       truck       0.90      0.91      0.91       500

    accuracy                           0.86      5000
   macro avg       0.86      0.86      0.86      5000
weighted avg       0.86      0.86      0.86      5000

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy', 'lr'])

[DB INFO] End of ResNet18 Training1 on CIFAR10...

epoch이 진행되며 정확도가 올라가는 것과 손실이 적어지는 것을 그래프로 표현하였다.

학습된 Resnet18 CNN 구조

이러한 메시지를 끝으로 10 개의 class의 학습을 진행하였다.

2.2 Vitis AI Flow on the CNN Model

Vitis AI Flow를 run_all.sh 파일을 통해서 실행할 수 있다. 아래의 명령어를 실행할 것이다.

cd /workspace/tutorials/RESNET18/files # your current directory
source run_all.sh quantize_resnet18_cifar10
source run_all.sh compile_resnet18_cifar10
source run_all.sh prepare_archives

a.양자화

vai_q_resnet18_cifar10.py 이 파이썬으로 작성된 파일로 인해 resnet18을 위한 양자화를 해준다. 양자화가 진행이 되면 build/quantized/ 경로에 .h5 형식의 파일들이 저장된 것을 볼 수 있다.

[SUMMARY INFO]:
- [Target Name]: DPUCZDX8G_ISA1_B4096
- [Total Layers]: 88
- [Layer Types]: InputLayer(1) BatchNormalization(19) ZeroPadding2D(18) Conv2D<linear>(21) Activation<relu>(18) MaxPooling2D(1) Add(8) GlobalAveragePooling2D(1) Dense<softmax>(1) 
- [Partition Results]: INPUT(1) DPU(86) DPU+CPU(1) 
========================================================================================================================
[NOTES INFO]:
- [87/87] Layer dense (Type:Dense<softmax>, Device:DPU+CPU):
    * Seperate layer activation `softmax`
    * `softmax` is not supported by target

이러한 88 개의 층(layer)으로 이루어진 구조의 DPU가 만들어졌다.

b. 컴파일

이제 양자화된 CNN을 DPU 구조로 컴파일을 해주는 과정을 가진다.

컴파일이 다 되면 /workspace/tutorials/RESNET18/files/target/cifar10 경로에 .xmodel이라는 파일이 만들어진다. 이 파일 안에는 이진화된 DPU 코드가 들어 있다. 각자의 파일들은 사용하고자 하는 보드(target board)에 따라서 맞춰서 사용하면 된다.

build_cifar10_test.sh   code                                v70_train2_resnet18_cifar10.xmodel     vck5000_train1_resnet18_cifar10.xmodel  vek280_train2_resnet18_cifar10.xmodel
cifar10_labels.dat      run_all_cifar10_target.sh           vck190_train1_resnet18_cifar10.xmodel  vck5000_train2_resnet18_cifar10.xmodel  zcu102_train1_resnet18_cifar10.xmodel
cifar10_performance.sh  v70_train1_resnet18_cifar10.xmodel  vck190_train2_resnet18_cifar10.xmodel  vek280_train1_resnet18_cifar10.xmodel   zcu102_train2_resnet18_cifar10.xmodel

2.3 Run on the Target Board

이번에는 VEK280이라는 보드를 사용하여 설계한 모델을 직접 FPGA 보드 위에서 구동하여 실제 성능을 점검해 보도록 하겠다.

보드 위에 구동하기 위한 필요한 명령어는 run_all_cifar10_target.sh 안에 들어가 있다. 이제 타겟 보드에 맞춰서 명령어를 수정해주면 된다.

#xxxyyy ex) zcu102, vck190, v70, vek280, etc
run_all_target.sh xxxyyy

a. multithreading

멀티스레딩 작업은 c++ 코드로 진행이 된다. 여기서 아래의 코드는 이미지 전처리 코드인데 특징은 DPU API가 openCV 함수를 쓴다는 것이다. 따라서 RGB가 아닌 BGR 형식을 쓴다.

/*image pre-process*/
Mat image2 = cv::Mat(inHeight, inWidth, CV_8SC3);
resize(image, image2, Size(inHeight, inWidth), 0, 0, INTER_NEAREST);
for (int h = 0; h < inHeight; h++) {
  for (int w = 0; w < inWidth; w++) {
    for (int c = 0; c < 3; c++) {
      imageInputs[i * inSize + h * inWidth * 3 + w * 3 + c] = (int8_t)( (image2.at<Vec3b>(h, w)[c])/255.0f - 0.5f)*2) * input_scale ); //if you use BGR
    //imageInputs[i * inSize + h * inWidth * 3 + w * 3 +2-c] = (int8_t)( (image2.at<Vec3b>(h, w)[c])/255.0f - 0.5f)*2) * input_scale ); //if you use RGB
    }
  }
}

주의: 학습 때의 전처리와 추론 때의 전처리는 mismatch(다른 방식으로 데이터 처리)가 없어야 한다.

b. Run-Time Execution

이제 보드를 키고 실행할 것이다. 보드를 키고 host PC에 연결한다.
Putty 터미널이나 테라텀 통신을 하고, 보드의 ip를 192.168.1.217로 설정하고 host PC는 192.168.1.140으로 설정한다.

host

scp target_vek280.tar root@[board ip address]:~/

board

tar -xvf target_vek280.tar
cd target_vek280
bash -x ./cifar10/run_all_cifar10_target.sh vek280

tar 명령어를 통해서 5,000 개의 이미지가 압축이 해제가 되었다.

위의 명령 결과이다.

c. Performance

bash -x ./cifar10/run_all_cifar10_target.sh vek280 이 명령어를 통해서 실제 보드에서 추론을 돌린 결과이다.

위와 같이 보드를 통해서 돌린 RESNET 18의 TOP5 정확도가 99%로 나온 것을 확인할 수 있었다.

3. Summary

위의 과정을 통해 Cifar10이라는 데이타셋을 받아와 FPGA에 돌려서 RESNET18이라는 CNN 딥러닝 네트워크를 돌리며 Vitis AI를 사용해서 최적화 된 DPU를 설계해 보았다. 다음은 Cifar10이 아닌 조금 더 복잡한 ImageNet이라는 Dataset을 사횽해 RESNET18을 FPGA 보드에 돌려볼 것이다.

Loloh_the_great

병아리가 되고 싶었으나 삶은 달걀이 되어버린 전자공학 졸업생