머신러닝용 GPU Infra 설정을 위한 설정 항목 및 세부 설정 내용
- GPU 유형: NVIDIA A100
- OS: Ubuntu 20.04.3 LTS
- 구성항목: docker, docker nvidia, Nvidia cuda
- conda
$ sudo apt-get remove docker docker-engine docker.io containerd runc
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package docker-engine
$ sudo apt-get update
0% [Working]
Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
Get:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-backports InRelease [101 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [1,302 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 c-n-f Metadata [14.4 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [867 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 c-n-f Metadata [19.4 kB]
Fetched 2,531 kB in 4s (627 kB/s)
Reading package lists... Done
$ sudo apt-get install \
ca-certificates \
curl \
gnupg \
lsb-release
Reading package lists... Done
Building dependency tree
Reading state information... Done
lsb-release is already the newest version (11.1.0ubuntu2).
lsb-release set to manually installed.
ca-certificates is already the newest version (20210119~20.04.2).
ca-certificates set to manually installed.
curl is already the newest version (7.68.0-1ubuntu2.7).
curl set to manually installed.
gnupg is already the newest version (2.2.19-3ubuntu2.1).
gnupg set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
stable 버전을 설치하자
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
$ sudo apt-get update
Get:1 https://download.docker.com/linux/ubuntu focal InRelease [57.7 kB]
Get:2 https://download.docker.com/linux/ubuntu focal/stable amd64 Packages [12.3 kB]
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:6 http://archive.ubuntu.com/ubuntu focal-security InRelease
Fetched 70.0 kB in 1s (48.1 kB/s)
Reading package lists... Done
$ sudo apt-get install docker-ce docker-ce-cli containerd.io
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
docker-ce-rootless-extras docker-scan-plugin pigz slirp4netns
Suggested packages:
aufs-tools cgroupfs-mount | cgroup-lite
The following NEW packages will be installed:
containerd.io docker-ce docker-ce-cli docker-ce-rootless-extras docker-scan-plugin pigz slirp4netns
0 upgraded, 7 newly installed, 0 to remove and 2 not upgraded.
Need to get 95.3 MB of archives.
After this operation, 402 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 https://download.docker.com/linux/ubuntu focal/stable amd64 containerd.io amd64 1.4.11-1 [23.7 MB]
Get:2 http://archive.ubuntu.com/ubuntu focal/universe amd64 pigz amd64 2.4-1 [57.4 kB]
Get:3 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-ce-cli amd64 5:20.10.10~3-0~ubuntu-focal [38.8 MB]
Get:4 http://archive.ubuntu.com/ubuntu focal/universe amd64 slirp4netns amd64 0.4.3-1 [74.3 kB]
Get:5 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-ce amd64 5:20.10.10~3-0~ubuntu-focal [21.2 MB]
Get:6 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-ce-rootless-extras amd64 5:20.10.10~3-0~ubuntu-focal [7,922 kB]
Get:7 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-scan-plugin amd64 0.9.0~ubuntu-focal [3,518 kB]
Fetched 95.3 MB in 3s (37.1 MB/s)
Selecting previously unselected package pigz.
(Reading database ... 135114 files and directories currently installed.)
Preparing to unpack .../0-pigz_2.4-1_amd64.deb ...
Unpacking pigz (2.4-1) ...
Selecting previously unselected package containerd.io.
Preparing to unpack .../1-containerd.io_1.4.11-1_amd64.deb ...
Unpacking containerd.io (1.4.11-1) ...
Selecting previously unselected package docker-ce-cli.
Preparing to unpack .../2-docker-ce-cli_5%3a20.10.10~3-0~ubuntu-focal_amd64.deb ...
Unpacking docker-ce-cli (5:20.10.10~3-0~ubuntu-focal) ...
Selecting previously unselected package docker-ce.
Preparing to unpack .../3-docker-ce_5%3a20.10.10~3-0~ubuntu-focal_amd64.deb ...
Unpacking docker-ce (5:20.10.10~3-0~ubuntu-focal) ...
Selecting previously unselected package docker-ce-rootless-extras.
Preparing to unpack .../4-docker-ce-rootless-extras_5%3a20.10.10~3-0~ubuntu-focal_amd64.deb ...
Unpacking docker-ce-rootless-extras (5:20.10.10~3-0~ubuntu-focal) ...
Selecting previously unselected package docker-scan-plugin.
Preparing to unpack .../5-docker-scan-plugin_0.9.0~ubuntu-focal_amd64.deb ...
Unpacking docker-scan-plugin (0.9.0~ubuntu-focal) ...
Selecting previously unselected package slirp4netns.
Preparing to unpack .../6-slirp4netns_0.4.3-1_amd64.deb ...
Unpacking slirp4netns (0.4.3-1) ...
Setting up slirp4netns (0.4.3-1) ...
Setting up docker-scan-plugin (0.9.0~ubuntu-focal) ...
Setting up containerd.io (1.4.11-1) ...
Created symlink /etc/systemd/system/multi-user.target.wants/containerd.service → /lib/systemd/system/containerd.service.
Setting up docker-ce-cli (5:20.10.10~3-0~ubuntu-focal) ...
Setting up pigz (2.4-1) ...
Setting up docker-ce-rootless-extras (5:20.10.10~3-0~ubuntu-focal) ...
Setting up docker-ce (5:20.10.10~3-0~ubuntu-focal) ...
Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /lib/systemd/system/docker.service.
Created symlink /etc/systemd/system/sockets.target.wants/docker.socket → /lib/systemd/system/docker.socket.
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for systemd (245.4-4ubuntu3.13) ...
$ apt-cache madison docker-ce
docker-ce | 5:20.10.10~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.9~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.8~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.7~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.6~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.5~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.4~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.3~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.2~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.1~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:20.10.0~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:19.03.15~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:19.03.14~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:19.03.13~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:19.03.12~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:19.03.11~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:19.03.10~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
docker-ce | 5:19.03.9~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
$ sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io
$ sudo apt-get install docker-ce=5:20.10.10~3-0~ubuntu-focal docker-ce-cli=5:20.10.10~3-0~ubuntu-focal containerd.io
Reading package lists... Done
Building dependency tree
Reading state information... Done
containerd.io is already the newest version (1.4.11-1).
docker-ce-cli is already the newest version (5:20.10.10~3-0~ubuntu-focal).
docker-ce is already the newest version (5:20.10.10~3-0~ubuntu-focal).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.
$ sudo usermod -aG docker [USER_NAME]
사용 추가 추에는 재 로그인 이후 명령어 사용 가능함
# systemctl list-unit-files | grep docker
docker.service enabled enabled
docker.socket enabled enabled
# systemctl enable docker.service
현재 루트 디렉터리 확인
# docker info | grep Root
Docker Root Dir: /var/lib/docker
루트 디렉터리 변경
$ sudo systemctl stop docker.socket
$ sudo systemctl stop docker
$ sudo vi /etc/docker/daemon.json
// 내용 추가
{
"data-root": "[target_directory]"
}
$ sudo systemctl start docker
변경 후 루트 디렉터리 확인
# docker info | grep Root
Docker Root Dir: /data/docker
격리된 Docker 컨테이너 리소스 안에서도 호스트의 GPU 자원을 활용하기 위해 nvidia docker 설치를 진행 함
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
OK
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /
$ sudo apt-get update
Get:1 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 InRelease [1,484 B]
Get:2 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 InRelease [1,481 B]
Get:3 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64 InRelease [1,474 B]
Hit:4 https://download.docker.com/linux/ubuntu focal InRelease
Get:5 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 Packages [12.9 kB]
Get:6 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 Packages [7,416 B]
Hit:7 http://archive.ubuntu.com/ubuntu focal InRelease
Get:8 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64 Packages [4,488 B]
Hit:9 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:10 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:11 http://archive.ubuntu.com/ubuntu focal-security InRelease
Fetched 29.3 kB in 1s (19.7 kB/s)
Reading package lists... Done
$ sudo apt-get install -y nvidia-docker2
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
libnvidia-container-tools libnvidia-container1 nvidia-container-runtime nvidia-container-toolkit
The following NEW packages will be installed:
libnvidia-container-tools libnvidia-container1 nvidia-container-runtime nvidia-container-toolkit nvidia-docker2
0 upgraded, 5 newly installed, 0 to remove and 2 not upgraded.
Need to get 1,589 kB of archives.
After this operation, 4,857 kB of additional disk space will be used.
Get:1 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 libnvidia-container1 1.5.1-1 [69.1 kB]
Get:2 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64 libnvidia-container-tools 1.5.1-1 [21.3 kB]
Get:3 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 nvidia-container-toolkit 1.5.1-1 [716 kB]
Get:4 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64 nvidia-container-runtime 3.5.0-1 [777 kB]
Get:5 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64 nvidia-docker2 2.6.0-1 [5,960 B]
Fetched 1,589 kB in 1s (1,222 kB/s)
Selecting previously unselected package libnvidia-container1:amd64.
(Reading database ... 135365 files and directories currently installed.)
Preparing to unpack .../libnvidia-container1_1.5.1-1_amd64.deb ...
Unpacking libnvidia-container1:amd64 (1.5.1-1) ...
Selecting previously unselected package libnvidia-container-tools.
Preparing to unpack .../libnvidia-container-tools_1.5.1-1_amd64.deb ...
Unpacking libnvidia-container-tools (1.5.1-1) ...
Selecting previously unselected package nvidia-container-toolkit.
Preparing to unpack .../nvidia-container-toolkit_1.5.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit (1.5.1-1) ...
Selecting previously unselected package nvidia-container-runtime.
Preparing to unpack .../nvidia-container-runtime_3.5.0-1_amd64.deb ...
Unpacking nvidia-container-runtime (3.5.0-1) ...
Selecting previously unselected package nvidia-docker2.
Preparing to unpack .../nvidia-docker2_2.6.0-1_all.deb ...
Unpacking nvidia-docker2 (2.6.0-1) ...
Setting up libnvidia-container1:amd64 (1.5.1-1) ...
Setting up libnvidia-container-tools (1.5.1-1) ...
Setting up nvidia-container-toolkit (1.5.1-1) ...
Setting up nvidia-container-runtime (3.5.0-1) ...
Setting up nvidia-docker2 (2.6.0-1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.2) ...
$ sudo systemctl restart docker
$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Unable to find image 'nvidia/cuda:11.0-base' locally
11.0-base: Pulling from nvidia/cuda
54ee1f796a1e: Pull complete
f7bfea53ad12: Pull complete
46d371e02073: Pull complete
b66c17bbf772: Pull complete
3642f1a6dfb3: Pull complete
e5ce55b8b4b9: Pull complete
155bc0332b0a: Pull complete
Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a
Status: Downloaded newer image for nvidia/cuda:11.0-base
Wed Nov 3 13:06:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:01:00.0 Off | 0 |
| N/A 27C P0 33W / 250W | 9MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:41:00.0 Off | 0 |
| N/A 29C P0 37W / 250W | 9MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... Off | 00000000:81:00.0 Off | 0 |
| N/A 28C P0 36W / 250W | 9MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
$ sudo rm -rf /usr/local/cuda*
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
--2021-11-03 14:56:39-- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-ubuntu2004.pin’
cuda-ubuntu2004.pin 100%[===========================================================================================>] 190 --.-KB/s in 0s
2021-11-03 14:56:39 (13.6 MB/s) - ‘cuda-ubuntu2004.pin’ saved [190/190]
$ sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ wget https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb
--2021-11-03 14:57:43-- https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2602827612 (2.4G) [application/x-deb]
Saving to: ‘cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb’
cuda-repo-ubuntu2004-11-5-local_11.5.0-495. 100%[===========================================================================================>] 2.42G 85.0MB/s in 30s
2021-11-03 14:58:12 (83.5 MB/s) - ‘cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb’ saved [2602827612/2602827612]
$ sudo dpkg -i cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb
Selecting previously unselected package cuda-repo-ubuntu2004-11-5-local.
(Reading database ... 135327 files and directories currently installed.)
Preparing to unpack cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb ...
Unpacking cuda-repo-ubuntu2004-11-5-local (11.5.0-495.29.05-1) ...
Setting up cuda-repo-ubuntu2004-11-5-local (11.5.0-495.29.05-1) ...
The public CUDA GPG key does not appear to be installed.
To install the key, run this command:
sudo apt-key add /var/cuda-repo-ubuntu2004-11-5-local/7fa2af80.pub
$ sudo apt-key add /var/cuda-repo-ubuntu2004-11-5-local/7fa2af80.pub
OK
$ sudo apt-get update
$ sudo apt-get -y install cuda
$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:
The following packages have unmet dependencies:
cuda : Depends: cuda-11-5 (>= 11.5.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
sudo apt clean
sudo apt update
sudo apt purge nvidia-*
sudo apt autoremove
sudo apt install -y cuda
재부팅 혹은 nvidia unmount 후 nvidia-smi 명령어 확인이 필요
# sudo lsof /dev/nvidia* | awk '{if(NR>1) print $2}' | sudo xargs kill -9