쿠버네티스 트러블슈팅의 기본 프로세스

HY🥗·2024년 9월 18일


미루고 미루던 쿠버네티스 공부를 시작했다. 24단계 실습으로 정복하는 쿠버네티스 책을 보고 따라가다 이제 명령어를 사용하는 파트를 넘어가고 자주 찾아볼 거 같은 쿠버네티스 트러블슈팅의 기본 프로세스 파트에 들어왔다. 필요한 부분을 메모해두고 두고두고 찾아보려고 블로그 포스트로 작성한다. 이정훈 님, 책 잘 보고 있습니다. 이 포스트를 보더라도 저작권으로 걸고 넘어지지 말아주세요. 중고 책도 아니고 새 책을 알라딘에서 주문했습니다.

Given: 존재하지 않는 이미지 버전의 yaml 파일로 pod을 생성했을 때

╭─ ~/.kube ···································································································· 1 х  ubun/default01 ○
╰─ k apply -f https://raw.githubusercontent.com/wikibook/kubepractice/main/ch05/nginx-error-pod.yml
pod/nginx-19 created
╭─ ~/.kube ······································································································ ✔  ubun/default01 ○
╰─ k get pod -o wide
nginx      1/1     Running        0          15m   ubun   <none>           <none>
nginx-19   0/1     ErrImagePull   0          8s   ubun   <none>           <none>
╭─ ~/.kube ·························································································
  1. describe 옵션으로 원인 파악
─ k describe pod nginx-19
Name:             nginx-19
Namespace:        default01
Priority:         0
Service Account:  default
Node:             <node-name>/<node-ip>
Start Time:       Wed, 18 Sep 2024 17:43:56 +0900
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               <internal-ip>
  IP:  <internal-ip>
    Container ID:
    Image:          nginx:1.19.19
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Environment:    <none>
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dk5lk (ro)
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       False
  ContainersReady             False
  PodScheduled                True
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  103s                default-scheduler  Successfully assigned default01/nginx-19 to ubun
  Normal   BackOff    25s (x5 over 101s)  kubelet            Back-off pulling image "nginx:1.19.19"
  Warning  Failed     25s (x5 over 101s)  kubelet            Error: ImagePullBackOff
  Normal   Pulling    11s (x4 over 103s)  kubelet            Pulling image "nginx:1.19.19"
  Warning  Failed     10s (x4 over 101s)  kubelet            Failed to pull image "nginx:1.19.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/nginx:1.19.19": failed to resolve reference "docker.io/library/nginx:1.19.19": docker.io/library/nginx:1.19.19: not found
  Warning  Failed     10s (x4 over 101s)  kubelet            Error: ErrImagePull

nginx:1.19.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/nginx:1.19.19" 로그로 존재하지 않는 이미지 버전이라는 원인을 파악할 수 있다.

  1. logs 로 팟 내부의 로그를 확인하기
╰─ k logs -f nginx
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2024/09/18 08:28:11 [notice] 1#1: using the "epoll" event method
2024/09/18 08:28:11 [notice] 1#1: nginx/1.27.1
2024/09/18 08:28:11 [notice] 1#1: built by gcc 12.2.0 (Debian 12.2.0-14)
2024/09/18 08:28:11 [notice] 1#1: OS: Linux 6.8.0-44-generic
2024/09/18 08:28:11 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2024/09/18 08:28:11 [notice] 1#1: start worker processes
2024/09/18 08:28:11 [notice] 1#1: start worker process 29
  1. get events로 클러스터 전반의 이벤트 확인
╰─ k get events
8m22s       Normal    Scheduled   pod/nginx-19   Successfully assigned default01/nginx-19 to ubun
6m50s       Normal    Pulling     pod/nginx-19   Pulling image "nginx:1.19.19"
6m49s       Warning   Failed      pod/nginx-19   Failed to pull image "nginx:1.19.19": rpc error: code = NotFound desc = failed to pull and unpack image "docker.io/library/nginx:1.19.19": failed to resolve reference "docker.io/library/nginx:1.19.19": docker.io/library/nginx:1.19.19: not found
6m49s       Warning   Failed      pod/nginx-19   Error: ErrImagePull
3m16s       Normal    BackOff     pod/nginx-19   Back-off pulling image "nginx:1.19.19"
6m36s       Warning   Failed      pod/nginx-19   Error: ImagePullBackOff
24m         Normal    Scheduled   pod/nginx      Successfully assigned default01/nginx to ubun
24m         Normal    Pulling     pod/nginx      Pulling image "nginx"
24m         Normal    Pulled      pod/nginx      Successfully pulled image "nginx" in 1.428s (1.428s including waiting). Image size: 67695038 bytes.
24m         Normal    Created     pod/nginx      Created container nginx
24m         Normal    Started     pod/nginx      Started container nginx

Given: 호스트 노드의 파일 시스템 용량이 초과됐을 때

단일 노드에 수십 개의 팟을 서비스하면 특정 팟의 사용량에 따라 해당 호스트 노드의 전체 cpu, 메모리, 디스크 용량이 부족한 상황이 발생할 수 있다. 특정 팟의 로그가 계속 쌓여 팟이 할당된 노드의 디스크 용량이 부족해지는 현상이 대표적.

sample로 주어진 yaml파일을 사용해 팟 수가 10 개인 busybox deployment를 실행한다.
한 노드에 접속해 여유 디스크 용량에 비해 큰 파일을 만들자 새로운 팟이 생성된다.

  1. k get events 로 클러스터 이벤트를 확인. 임시 스토리지 자원 부족 확인 가능.
LAST SEEN   TYPE      REASON                OBJECT                          MESSAGE
6m13s       Normal    Scheduled             pod/busybox-68bdc48c69-4sdnf    Successfully assigned default01/busybox-68bdc48c69-4sdnf to ubun
6m12s       Normal    Pulling               pod/busybox-68bdc48c69-4sdnf    Pulling image "busybox"
6m7s        Normal    Pulled                pod/busybox-68bdc48c69-4sdnf    Successfully pulled image "busybox" in 4.927s (4.927s including waiting). Image size: 1851657 bytes.
6m7s        Normal    Created               pod/busybox-68bdc48c69-4sdnf    Created container busybox
6m7s        Normal    Started               pod/busybox-68bdc48c69-4sdnf    Started container busybox
6m14s       Normal    Scheduled             pod/busybox-68bdc48c69-bbqzp    Successfully assigned default01/busybox-68bdc48c69-bbqzp to ubun
6m12s       Normal    Pulling               pod/busybox-68bdc48c69-bbqzp    Pulling image "busybox"
6m7s        Normal    Pulled                pod/busybox-68bdc48c69-bbqzp    Successfully pulled image "busybox" in 4.926s (4.926s including waiting). Image size: 1851657 bytes.
6m7s        Normal    Created               pod/busybox-68bdc48c69-bbqzp    Created container busybox
6m7s        Normal    Started               pod/busybox-68bdc48c69-bbqzp    Started container busybox
2m2s        Warning   Evicted               pod/busybox-68bdc48c69-bbqzp    The node was low on resource: ephemeral-storage. Threshold quantity: 204194409, available: 449796Ki. Container busybox was using 707088Ki, request is 0, has larger consumption of ephemeral-storage.
  1. k describe node 로 해당 노드 상세 메세지로 확인 가능
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                200m (20%)   0 (0%)
  memory             140Mi (14%)  170Mi (17%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  hugepages-32Mi     0 (0%)       0 (0%)
  hugepages-64Ki     0 (0%)       0 (0%)
  Type     Reason                Age                    From     Message
  ----     ------                ----                   ----     -------
  Warning  EvictionThresholdMet  4m26s (x2 over 4m28s)  kubelet  Attempting to reclaim ephemeral-storage
  Normal   NodeHasDiskPressure   4m21s                  kubelet  Node ubun status is now: NodeHasDiskPressure

맛보기만 하고 있다고 생각하지만 그래도 인프라는 늘 어렵다. 서버 엔지니어로써 어디까지 알아야 하는 걸까? 예전에 유사한 질문을 했을 때 선임님이 우리 서비스에 문제가 생겼을 때 우리가 해결해야 하는 문제인지 아님 인프라 팀에 문의해야 할 문제인지 판단하는 수준은 되어야 한다고 하셨는데, 매번 새로운 문제가 터지니 알 수가 없다. 아무튼 이 파트 끝.

