AWS EC2 Docker GPU 환경 구성

ImOk·2022년 4월 24일

aws docker ec2 gpu

AWS

목록 보기

3/4

AWS EC2에서 Docker GPU 환경 구동 시 발생 ERROR 해결과정

AWS EC2 안에서 Docker로 딥러닝 GPU 환경 구성 과정에서 발생한 ERROR 해결 과정을 기록합니다.

인스턴스 유형 : g4dn.4xlarge

1. NVIDIA 확인

$lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)

2. docker run --gpus ERROR

ubuntu: 18.04
cuda: 11.3.1
torch: 1.7.1

2.1. docker run error 발생

$docker run -it --name pytorch --gpus '"device=0"' --network airflownet -v $PWD/notebooks:/notebooks -p 8888:8888 cuda11.3:pytorch1.7.1

docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled

2.1. 해결 과정 : nvidia-container-toolkit을 설치

# nvidia-container-toolkit 설치
$distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
   
$sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit

# docker service 재시작
$sudo systemctl restart docker

2.2. 다시 docker run error 발생

$docker run -it --name pytorch --gpus '"device=0"' --network airflownet -v $PWD/notebooks:/notebooks -p 8888:8888 cuda11.3:pytorch1.7.1

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled

2.2. 해결 과정 1 : nvidia-utils 설치

nvidia-smi 명령어 확인

$nvidia-smi

Command 'nvidia-smi' not found, but can be installed with:

sudo apt install nvidia-340
sudo apt install nvidia-utils-390

$sudo apt install nvidia-utils-390

error 발생

$nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

2.2 해결과정 2 : nvidia-driver 설치

$sudo apt install -y nvidia-driver-470
# 재시작해줘야 함
$sudo reboot now

2.3 해결 완료

$nvidia-smi
Sun Apr 24 08:09:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3. CUDA 버전 확인

$nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0