머신러닝을 위한 On-premise Infra 설정

leesj·2021년 11월 3일
0

Infra

목록 보기
5/9
post-custom-banner

머신러닝용 GPU Infra 설정을 위한 설정 항목 및 세부 설정 내용

  • GPU 유형: NVIDIA A100
  • OS: Ubuntu 20.04.3 LTS
  • 구성항목: docker, docker nvidia, Nvidia cuda
  • conda

Nvidia drive

Docker install

Uninstall old versions

$ sudo apt-get remove docker docker-engine docker.io containerd runc

Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package docker-engine

Install using the repository

Set up the repository

$ sudo apt-get update

0% [Working]
Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease
Get:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal-backports InRelease [101 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [1,302 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 c-n-f Metadata [14.4 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [867 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 c-n-f Metadata [19.4 kB]
Fetched 2,531 kB in 4s (627 kB/s)
Reading package lists... Done
$ sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

Reading package lists... Done
Building dependency tree
Reading state information... Done
lsb-release is already the newest version (11.1.0ubuntu2).
lsb-release set to manually installed.
ca-certificates is already the newest version (20210119~20.04.2).
ca-certificates set to manually installed.
curl is already the newest version (7.68.0-1ubuntu2.7).
curl set to manually installed.
gnupg is already the newest version (2.2.19-3ubuntu2.1).
gnupg set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.

Add Docker’s official GPG key

$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

stable 버전을 설치하자

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Install Docker Engine

$ sudo apt-get update

Get:1 https://download.docker.com/linux/ubuntu focal InRelease [57.7 kB]
Get:2 https://download.docker.com/linux/ubuntu focal/stable amd64 Packages [12.3 kB]
Hit:3 http://archive.ubuntu.com/ubuntu focal InRelease
Hit:4 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:5 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:6 http://archive.ubuntu.com/ubuntu focal-security InRelease
Fetched 70.0 kB in 1s (48.1 kB/s)
Reading package lists... Done
$ sudo apt-get install docker-ce docker-ce-cli containerd.io

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  docker-ce-rootless-extras docker-scan-plugin pigz slirp4netns
Suggested packages:
  aufs-tools cgroupfs-mount | cgroup-lite
The following NEW packages will be installed:
  containerd.io docker-ce docker-ce-cli docker-ce-rootless-extras docker-scan-plugin pigz slirp4netns
0 upgraded, 7 newly installed, 0 to remove and 2 not upgraded.
Need to get 95.3 MB of archives.
After this operation, 402 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y
Get:1 https://download.docker.com/linux/ubuntu focal/stable amd64 containerd.io amd64 1.4.11-1 [23.7 MB]
Get:2 http://archive.ubuntu.com/ubuntu focal/universe amd64 pigz amd64 2.4-1 [57.4 kB]
Get:3 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-ce-cli amd64 5:20.10.10~3-0~ubuntu-focal [38.8 MB]
Get:4 http://archive.ubuntu.com/ubuntu focal/universe amd64 slirp4netns amd64 0.4.3-1 [74.3 kB]
Get:5 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-ce amd64 5:20.10.10~3-0~ubuntu-focal [21.2 MB]
Get:6 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-ce-rootless-extras amd64 5:20.10.10~3-0~ubuntu-focal [7,922 kB]
Get:7 https://download.docker.com/linux/ubuntu focal/stable amd64 docker-scan-plugin amd64 0.9.0~ubuntu-focal [3,518 kB]
Fetched 95.3 MB in 3s (37.1 MB/s)
Selecting previously unselected package pigz.
(Reading database ... 135114 files and directories currently installed.)
Preparing to unpack .../0-pigz_2.4-1_amd64.deb ...
Unpacking pigz (2.4-1) ...
Selecting previously unselected package containerd.io.
Preparing to unpack .../1-containerd.io_1.4.11-1_amd64.deb ...
Unpacking containerd.io (1.4.11-1) ...
Selecting previously unselected package docker-ce-cli.
Preparing to unpack .../2-docker-ce-cli_5%3a20.10.10~3-0~ubuntu-focal_amd64.deb ...
Unpacking docker-ce-cli (5:20.10.10~3-0~ubuntu-focal) ...
Selecting previously unselected package docker-ce.
Preparing to unpack .../3-docker-ce_5%3a20.10.10~3-0~ubuntu-focal_amd64.deb ...
Unpacking docker-ce (5:20.10.10~3-0~ubuntu-focal) ...
Selecting previously unselected package docker-ce-rootless-extras.
Preparing to unpack .../4-docker-ce-rootless-extras_5%3a20.10.10~3-0~ubuntu-focal_amd64.deb ...
Unpacking docker-ce-rootless-extras (5:20.10.10~3-0~ubuntu-focal) ...
Selecting previously unselected package docker-scan-plugin.
Preparing to unpack .../5-docker-scan-plugin_0.9.0~ubuntu-focal_amd64.deb ...
Unpacking docker-scan-plugin (0.9.0~ubuntu-focal) ...
Selecting previously unselected package slirp4netns.
Preparing to unpack .../6-slirp4netns_0.4.3-1_amd64.deb ...
Unpacking slirp4netns (0.4.3-1) ...
Setting up slirp4netns (0.4.3-1) ...
Setting up docker-scan-plugin (0.9.0~ubuntu-focal) ...
Setting up containerd.io (1.4.11-1) ...
Created symlink /etc/systemd/system/multi-user.target.wants/containerd.service → /lib/systemd/system/containerd.service.
Setting up docker-ce-cli (5:20.10.10~3-0~ubuntu-focal) ...
Setting up pigz (2.4-1) ...
Setting up docker-ce-rootless-extras (5:20.10.10~3-0~ubuntu-focal) ...
Setting up docker-ce (5:20.10.10~3-0~ubuntu-focal) ...
Created symlink /etc/systemd/system/multi-user.target.wants/docker.service → /lib/systemd/system/docker.service.
Created symlink /etc/systemd/system/sockets.target.wants/docker.socket → /lib/systemd/system/docker.socket.
Processing triggers for man-db (2.9.1-1) ...
Processing triggers for systemd (245.4-4ubuntu3.13) ...
  • 특정 버전의 도커를 설치하고자 하나면 아래와 같이 버전을 조회 해보고 특정 버전을 선택함
$ apt-cache madison docker-ce
 docker-ce | 5:20.10.10~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.9~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.8~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.7~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.6~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.5~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.4~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.3~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.2~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.1~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:20.10.0~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:19.03.15~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:19.03.14~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:19.03.13~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:19.03.12~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:19.03.11~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:19.03.10~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
 docker-ce | 5:19.03.9~3-0~ubuntu-focal | https://download.docker.com/linux/ubuntu focal/stable amd64 Packages
  • 최신버전을 설치하자
    5:20.10.10~3-0~ubuntu-focal
$ sudo apt-get install docker-ce=<VERSION_STRING> docker-ce-cli=<VERSION_STRING> containerd.io
$ sudo apt-get install docker-ce=5:20.10.10~3-0~ubuntu-focal docker-ce-cli=5:20.10.10~3-0~ubuntu-focal containerd.io
Reading package lists... Done
Building dependency tree
Reading state information... Done
containerd.io is already the newest version (1.4.11-1).
docker-ce-cli is already the newest version (5:20.10.10~3-0~ubuntu-focal).
docker-ce is already the newest version (5:20.10.10~3-0~ubuntu-focal).
0 upgraded, 0 newly installed, 0 to remove and 2 not upgraded.

도커 명령어 사용 가능하도록 사용자 추가

$ sudo usermod -aG docker [USER_NAME]

사용 추가 추에는 재 로그인 이후 명령어 사용 가능함

시스템 재부팅시 자동 재시작

  • 조회하여 enabled 상태인지 확인
# systemctl list-unit-files | grep docker
docker.service                             enabled         enabled
docker.socket                              enabled         enabled
  • disabled 상태라면 아래와 같이 활성화
# systemctl enable docker.service

Docker root directory 변경

현재 루트 디렉터리 확인

# docker info | grep Root
Docker Root Dir: /var/lib/docker

루트 디렉터리 변경

$ sudo systemctl stop docker.socket
$ sudo systemctl stop docker

$ sudo vi /etc/docker/daemon.json

// 내용 추가
{
"data-root": "[target_directory]"
}

$ sudo systemctl start docker

변경 후 루트 디렉터리 확인

# docker info | grep Root
Docker Root Dir: /data/docker

Nvidia docker install

격리된 Docker 컨테이너 리소스 안에서도 호스트의 GPU 자원을 활용하기 위해 nvidia docker 설치를 진행 함

Docker 설치 진행

Setting up NVIDIA Container Toolkit

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
   
   
   
OK
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
#deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /
$ sudo apt-get update

Get:1 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  InRelease [1,484 B]
Get:2 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  InRelease [1,481 B]
Get:3 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  InRelease [1,474 B]
Hit:4 https://download.docker.com/linux/ubuntu focal InRelease
Get:5 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  Packages [12.9 kB]
Get:6 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  Packages [7,416 B]
Hit:7 http://archive.ubuntu.com/ubuntu focal InRelease
Get:8 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  Packages [4,488 B]
Hit:9 http://archive.ubuntu.com/ubuntu focal-updates InRelease
Hit:10 http://archive.ubuntu.com/ubuntu focal-backports InRelease
Hit:11 http://archive.ubuntu.com/ubuntu focal-security InRelease
Fetched 29.3 kB in 1s (19.7 kB/s)
Reading package lists... Done
$ sudo apt-get install -y nvidia-docker2

Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  libnvidia-container-tools libnvidia-container1 nvidia-container-runtime nvidia-container-toolkit
The following NEW packages will be installed:
  libnvidia-container-tools libnvidia-container1 nvidia-container-runtime nvidia-container-toolkit nvidia-docker2
0 upgraded, 5 newly installed, 0 to remove and 2 not upgraded.
Need to get 1,589 kB of archives.
After this operation, 4,857 kB of additional disk space will be used.
Get:1 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  libnvidia-container1 1.5.1-1 [69.1 kB]
Get:2 https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/amd64  libnvidia-container-tools 1.5.1-1 [21.3 kB]
Get:3 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  nvidia-container-toolkit 1.5.1-1 [716 kB]
Get:4 https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/amd64  nvidia-container-runtime 3.5.0-1 [777 kB]
Get:5 https://nvidia.github.io/nvidia-docker/ubuntu18.04/amd64  nvidia-docker2 2.6.0-1 [5,960 B]
Fetched 1,589 kB in 1s (1,222 kB/s)
Selecting previously unselected package libnvidia-container1:amd64.
(Reading database ... 135365 files and directories currently installed.)
Preparing to unpack .../libnvidia-container1_1.5.1-1_amd64.deb ...
Unpacking libnvidia-container1:amd64 (1.5.1-1) ...
Selecting previously unselected package libnvidia-container-tools.
Preparing to unpack .../libnvidia-container-tools_1.5.1-1_amd64.deb ...
Unpacking libnvidia-container-tools (1.5.1-1) ...
Selecting previously unselected package nvidia-container-toolkit.
Preparing to unpack .../nvidia-container-toolkit_1.5.1-1_amd64.deb ...
Unpacking nvidia-container-toolkit (1.5.1-1) ...
Selecting previously unselected package nvidia-container-runtime.
Preparing to unpack .../nvidia-container-runtime_3.5.0-1_amd64.deb ...
Unpacking nvidia-container-runtime (3.5.0-1) ...
Selecting previously unselected package nvidia-docker2.
Preparing to unpack .../nvidia-docker2_2.6.0-1_all.deb ...
Unpacking nvidia-docker2 (2.6.0-1) ...
Setting up libnvidia-container1:amd64 (1.5.1-1) ...
Setting up libnvidia-container-tools (1.5.1-1) ...
Setting up nvidia-container-toolkit (1.5.1-1) ...
Setting up nvidia-container-runtime (3.5.0-1) ...
Setting up nvidia-docker2 (2.6.0-1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.2) ...
  • 도커 데몬 재시작
$ sudo systemctl restart docker
  • 테스트: CUDA 컨테이너를 실행하여 작동 설정을 테스트 할 수 있음
$ sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
  • 아래와 같이 노출되면 정상 설치
Unable to find image 'nvidia/cuda:11.0-base' locally
11.0-base: Pulling from nvidia/cuda
54ee1f796a1e: Pull complete
f7bfea53ad12: Pull complete
46d371e02073: Pull complete
b66c17bbf772: Pull complete
3642f1a6dfb3: Pull complete
e5ce55b8b4b9: Pull complete
155bc0332b0a: Pull complete
Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a
Status: Downloaded newer image for nvidia/cuda:11.0-base
Wed Nov  3 13:06:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.74       Driver Version: 470.74       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:01:00.0 Off |                    0 |
| N/A   27C    P0    33W / 250W |      9MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:41:00.0 Off |                    0 |
| N/A   29C    P0    37W / 250W |      9MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCI...  Off  | 00000000:81:00.0 Off |                    0 |
| N/A   28C    P0    36W / 250W |      9MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

CUDA install

기존 CUDA 삭제

$ sudo rm -rf /usr/local/cuda*

cuda 버전 다운로드 진행

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin

--2021-11-03 14:56:39--  https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 190 [application/octet-stream]
Saving to: ‘cuda-ubuntu2004.pin’

cuda-ubuntu2004.pin                         100%[===========================================================================================>]     190  --.-KB/s    in 0s

2021-11-03 14:56:39 (13.6 MB/s) - ‘cuda-ubuntu2004.pin’ saved [190/190]
$ sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ wget https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb

--2021-11-03 14:57:43--  https://developer.download.nvidia.com/compute/cuda/11.5.0/local_installers/cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb
Resolving developer.download.nvidia.com (developer.download.nvidia.com)... 152.199.39.144
Connecting to developer.download.nvidia.com (developer.download.nvidia.com)|152.199.39.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2602827612 (2.4G) [application/x-deb]
Saving to: ‘cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb’

cuda-repo-ubuntu2004-11-5-local_11.5.0-495. 100%[===========================================================================================>]   2.42G  85.0MB/s    in 30s

2021-11-03 14:58:12 (83.5 MB/s) - ‘cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb’ saved [2602827612/2602827612]

$ sudo dpkg -i cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb
Selecting previously unselected package cuda-repo-ubuntu2004-11-5-local.
(Reading database ... 135327 files and directories currently installed.)
Preparing to unpack cuda-repo-ubuntu2004-11-5-local_11.5.0-495.29.05-1_amd64.deb ...
Unpacking cuda-repo-ubuntu2004-11-5-local (11.5.0-495.29.05-1) ...
Setting up cuda-repo-ubuntu2004-11-5-local (11.5.0-495.29.05-1) ...

The public CUDA GPG key does not appear to be installed.
To install the key, run this command:
sudo apt-key add /var/cuda-repo-ubuntu2004-11-5-local/7fa2af80.pub
$ sudo apt-key add /var/cuda-repo-ubuntu2004-11-5-local/7fa2af80.pub

OK
$ sudo apt-get update
$ sudo apt-get -y install cuda

Trouble shooting

$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda : Depends: cuda-11-5 (>= 11.5.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.
sudo apt clean
sudo apt update
sudo apt purge nvidia-* 
sudo apt autoremove
sudo apt install -y cuda

Failed to initialize NVML: Driver/library version mismatch

# sudo lsof /dev/nvidia* | awk '{if(NR>1) print $2}' | sudo xargs kill -9

GPU monitoring tools

Conda 설치

post-custom-banner

0개의 댓글