미루고 미루던 쿠버네티스 공부를 시작했다. 24단계 실습으로 정복하는 쿠버네티스 책을 보고 따라가다 이제 명령어를 사용하는 파트를 넘어가고 자주 찾아볼 거 같은 쿠버네티스 트러블슈팅의 기본 프로세스 파트에 들어왔다. 필요한 부분을 메모해두고 두고두고 찾아보려고 블로그 포스트로 작성한다. 이정훈 님, 책 잘 보고 있습니다. 이 포스트를 보더라도 저작권으로 걸고 넘어지지 말아주세요. 중고 책도 아니고 새 책을 알라딘에서 주문했습니다.
╭─ ~/.kube ···································································································· 1 х ubun/default01 ○
╰─ k apply -f https://raw.githubusercontent.com/wikibook/kubepractice/main/ch05/nginx-error-pod.yml
pod/nginx-19 created
╭─ ~/.kube ······································································································ ✔ ubun/default01 ○
╰─ k get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 15m 10.42.0.33 ubun <none> <none>
nginx-19 0/1 ErrImagePull 0 8s 10.42.0.34 ubun <none> <none>
╭─ ~/.kube ·························································································
describe
옵션으로 원인 파악─ k describe pod nginx-19
Name: nginx-19
Namespace: default01
Priority: 0
Service Account: default
Node: <node-name>/<node-ip>
Start Time: Wed, 18 Sep 2024 17:43:56 +0900
Labels: <none>
Annotations: <none>
Status: Pending
IP: <internal-ip>
IPs:
IP: <internal-ip>
Containers:
nginx-pod:
Container ID:
Image: nginx:1.19.19
Image ID:
Port: <none>
Host Port: <none>
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dk5lk (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-dk5lk:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 103s default-scheduler Successfully assigned default01/nginx-19 to ubun
Normal BackOff 25s (x5 over 101s) kubelet Back-off pulling image "nginx:1.19.19"
Warning Failed 25s (x5 over 101s) kubelet Error: ImagePullBackOff
Normal Pulling 11s (x4 over 103s) kubelet Pulling image "nginx:1.19.19"
Warning Failed 10s (x4 over 101s) kubelet Failed to pull image "nginx:1.19.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/nginx:1.19.19": failed to resolve reference "docker.io/library/nginx:1.19.19": docker.io/library/nginx:1.19.19: not found
Warning Failed 10s (x4 over 101s) kubelet Error: ErrImagePull
nginx:1.19.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/nginx:1.19.19"
로그로 존재하지 않는 이미지 버전이라는 원인을 파악할 수 있다.
logs
로 팟 내부의 로그를 확인하기╰─ k logs -f nginx
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2024/09/18 08:28:11 [notice] 1#1: using the "epoll" event method
2024/09/18 08:28:11 [notice] 1#1: nginx/1.27.1
2024/09/18 08:28:11 [notice] 1#1: built by gcc 12.2.0 (Debian 12.2.0-14)
2024/09/18 08:28:11 [notice] 1#1: OS: Linux 6.8.0-44-generic
2024/09/18 08:28:11 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2024/09/18 08:28:11 [notice] 1#1: start worker processes
2024/09/18 08:28:11 [notice] 1#1: start worker process 29
get events
로 클러스터 전반의 이벤트 확인╰─ k get events
LAST SEEN TYPE REASON OBJECT MESSAGE
8m22s Normal Scheduled pod/nginx-19 Successfully assigned default01/nginx-19 to ubun
6m50s Normal Pulling pod/nginx-19 Pulling image "nginx:1.19.19"
6m49s Warning Failed pod/nginx-19 Failed to pull image "nginx:1.19.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/nginx:1.19.19": failed to resolve reference "docker.io/library/nginx:1.19.19": docker.io/library/nginx:1.19.19: not found
6m49s Warning Failed pod/nginx-19 Error: ErrImagePull
3m16s Normal BackOff pod/nginx-19 Back-off pulling image "nginx:1.19.19"
6m36s Warning Failed pod/nginx-19 Error: ImagePullBackOff
24m Normal Scheduled pod/nginx Successfully assigned default01/nginx to ubun
24m Normal Pulling pod/nginx Pulling image "nginx"
24m Normal Pulled pod/nginx Successfully pulled image "nginx" in 1.428s (1.428s including waiting). Image size: 67695038 bytes.
24m Normal Created pod/nginx Created container nginx
24m Normal Started pod/nginx Started container nginx
단일 노드에 수십 개의 팟을 서비스하면 특정 팟의 사용량에 따라 해당 호스트 노드의 전체 cpu, 메모리, 디스크 용량이 부족한 상황이 발생할 수 있다. 특정 팟의 로그가 계속 쌓여 팟이 할당된 노드의 디스크 용량이 부족해지는 현상이 대표적.
sample로 주어진 yaml파일을 사용해 팟 수가 10 개인 busybox deployment를 실행한다.
한 노드에 접속해 여유 디스크 용량에 비해 큰 파일을 만들자 새로운 팟이 생성된다.
k get events
로 클러스터 이벤트를 확인. 임시 스토리지 자원 부족 확인 가능. LAST SEEN TYPE REASON OBJECT MESSAGE
6m13s Normal Scheduled pod/busybox-68bdc48c69-4sdnf Successfully assigned default01/busybox-68bdc48c69-4sdnf to ubun
6m12s Normal Pulling pod/busybox-68bdc48c69-4sdnf Pulling image "busybox"
6m7s Normal Pulled pod/busybox-68bdc48c69-4sdnf Successfully pulled image "busybox" in 4.927s (4.927s including waiting). Image size: 1851657 bytes.
6m7s Normal Created pod/busybox-68bdc48c69-4sdnf Created container busybox
6m7s Normal Started pod/busybox-68bdc48c69-4sdnf Started container busybox
6m14s Normal Scheduled pod/busybox-68bdc48c69-bbqzp Successfully assigned default01/busybox-68bdc48c69-bbqzp to ubun
6m12s Normal Pulling pod/busybox-68bdc48c69-bbqzp Pulling image "busybox"
6m7s Normal Pulled pod/busybox-68bdc48c69-bbqzp Successfully pulled image "busybox" in 4.926s (4.926s including waiting). Image size: 1851657 bytes.
6m7s Normal Created pod/busybox-68bdc48c69-bbqzp Created container busybox
6m7s Normal Started pod/busybox-68bdc48c69-bbqzp Started container busybox
2m2s Warning Evicted pod/busybox-68bdc48c69-bbqzp The node was low on resource: ephemeral-storage. Threshold quantity: 204194409, available: 449796Ki. Container busybox was using 707088Ki, request is 0, has larger consumption of ephemeral-storage.
k describe node
로 해당 노드 상세 메세지로 확인 가능Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 200m (20%) 0 (0%)
memory 140Mi (14%) 170Mi (17%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
hugepages-32Mi 0 (0%) 0 (0%)
hugepages-64Ki 0 (0%) 0 (0%)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning EvictionThresholdMet 4m26s (x2 over 4m28s) kubelet Attempting to reclaim ephemeral-storage
Normal NodeHasDiskPressure 4m21s kubelet Node ubun status is now: NodeHasDiskPressure
맛보기만 하고 있다고 생각하지만 그래도 인프라는 늘 어렵다. 서버 엔지니어로써 어디까지 알아야 하는 걸까? 예전에 유사한 질문을 했을 때 선임님이 우리 서비스에 문제가 생겼을 때 우리가 해결해야 하는 문제인지 아님 인프라 팀에 문의해야 할 문제인지 판단하는 수준은 되어야 한다고 하셨는데, 매번 새로운 문제가 터지니 알 수가 없다. 아무튼 이 파트 끝.