이전에 nvidia V100 GPU를 포함한 k8s cluster를 구축하고 테스트 한 적이 있는데, 이번에 새로 구성할 필요가 있어 변경된 사항을 정리하고 테스트 해 본다.
이전에는 https://nvidia.github.io/nvidia-docker 에서 설치문서를 확인했으나 현재 해당 페이지에 가보면 https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html 를 참조하도록 되어 있다.
이제 최신 문서인 NVIDIA Container Toolkit를 활용하여 설치해 보자.
lshw -C display
apt search 명령으로 설치 가능한 nvidia-driver를 조회하고 최신 버전을 설치한다.apt search nvidia-driver
apt update
apt upgrade
apt install nvidia-driver-510 nvidia-dkms-510
// 시스템을 reboot하여 GPU 를 인식하게 한다.
sudo reboot
nvidia-smi
4.1. nvidia container toolkit 서치
root 계정으로 nvidia container toolkit을 설치한다.
$ curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-tookit-keyring.gpp \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
&& apt-get update
$ apt-get install -y nvidia-container-toolkit
cat <<EOF > /etc/yum.repo.d/nvidia-container-toolkit.repo
[nvidia-container-toolkit]
name=nvidia-container-toolkit
baseurl=https://nvidia.github.io/libnvidia-container/stable/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=1
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
[nvidia-container-toolkit-experimental]
name=nvidia-container-toolkit-experimental
baseurl=https://nvidia.github.io/libnvidia-container/experimental/rpm/$basearch
repo_gpgcheck=1
gpgcheck=0
enabled=0
gpgkey=https://nvidia.github.io/libnvidia-container/gpgkey
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt
4.2. containerd 구성파일에 nvidia-container-runtime 구성
nvidia-ctk 명령을 사용하면 편리하게 구성을 추가할 수 있다.
vi /etc/containerd/config.toml
nvidia-ctk runtime configure --runtime=containerd
systemctl restart containerd
4.3 containerd에서 gpu 사용확인
$ sudo ctr image pull docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04
$ sudo ctr run --rm -t \
--runc-binary=/usr/bin/nvidia-container-runtime \
--env NVIDIA_VISIBLE_DEVICES=all \
docker.io/nvidia/cuda:11.0.3-base-ubuntu20.04 \
cuda-11.0.3-base-ubuntu20.04 nvidia-smi
4.4. kubernetes에서 nvidia device plugin 데몬셋 설치
k8s nvidia plugin은 다음과 같은 역할을 수행한다.
https://github.com/NVIDIA/k8s-device-plugin 을 참고하면 간단히 daemonset을 배포하여 설치할 수도 있고, helm을 통해서 다양한 커스텀이 가능하다.
먼저 GPU node에 label을 아래와 같이 지정한다.
kubectl label nodes [gpu가 장착된 노드명 1] gpu=nvidia
Nvidia device plugin daemonset을 배포한다.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
+ nodeSelector:
+ gpu: nvidia
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.12.3
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
$ nvidia-runtime-class.yml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
# The name the RuntimeClass will be referenced by.
# RuntimeClass is a non-namespaced resource.
name: "nvidia"
# The name of the corresponding CRI configuration
handler: "nvidia"
kubectl apply -f nvidia-runtime-class.yml
4.5 테스트
apiVersion: v1
kind: Pod
metadata:
name: gpu
spec:
restartPolicy: Never
runtimeClassName: "nvidia" # nvidia runtime 사용
nodeSelector:
gpu: nvidia # gpu가 장착된 node에만 배포
containers:
- name: gpu
image: "nvidia/cuda:11.4.1-base-ubuntu20.04"
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
kubectl apply -f test-gpu-pod.yml
kubectl exec -it gpu -- nvidia-smi
Wed Sep 28 03:03:22 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:03:00.0 Off | N/A |
| N/A 39C P8 N/A / N/A | 4MiB / 4096MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+