AWS EC2 안에서 Docker로 딥러닝 GPU 환경 구성 과정에서 발생한 ERROR 해결 과정을 기록합니다.
인스턴스 유형 : g4dn.4xlarge
$lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation Device 1eb8 (rev a1)
ubuntu: 18.04
cuda: 11.3.1
torch: 1.7.1
$docker run -it --name pytorch --gpus '"device=0"' --network airflownet -v $PWD/notebooks:/notebooks -p 8888:8888 cuda11.3:pytorch1.7.1
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
ERRO[0000] error waiting for container: context canceled
# nvidia-container-toolkit 설치
$distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
# docker service 재시작
$sudo systemctl restart docker
$docker run -it --name pytorch --gpus '"device=0"' --network airflownet -v $PWD/notebooks:/notebooks -p 8888:8888 cuda11.3:pytorch1.7.1
docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container: context canceled
nvidia-smi
명령어 확인$nvidia-smi
Command 'nvidia-smi' not found, but can be installed with:
sudo apt install nvidia-340
sudo apt install nvidia-utils-390
$sudo apt install nvidia-utils-390
$nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
$sudo apt install -y nvidia-driver-470
# 재시작해줘야 함
$sudo reboot now
$nvidia-smi
Sun Apr 24 08:09:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 28W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0