예상 못한 노드 재기동후 ImagePullBackOff

sanggyun bak·2023년 9월 14일

쿠버네티스

예상 못한 노드 재기동후 ImagePullBackOff

문제 상황

hypervm 서버가 다운되어 해당 서버에서 동작중이던 가상머신 (master1, woker1, worker2)이 종료됨
그로인해 파드 전체가 뜨지 않고 일부 파드의 상태가 imagepullbackoff

문제 원인

podman ps로 현재 동작중인 컨테이너를 확인했을 떄 이미지 레지스트리가 뜨지 않음

레지스트리는 systemd의 서비스로 등록되어있는데, 노드 재기동 후 레지스트리의 서비스가 자동으로 재시작 되지 않아 뜨지 않은것으로 보임
파드맨 서비스가 inactive 상태임

systemctl status podman

● podman.service - Podman API Service
   Loaded: loaded (/usr/lib/systemd/system/podman.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2023-09-14 08:49:04 KST; 21s ago
     Docs: man:podman-system-service(1)
  Process: 19988 ExecStart=/usr/bin/podman $LOGGING system service (code=exited, status=0/SUCCESS)
 Main PID: 19988 (code=exited, status=0/SUCCESS)

Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=warning msg="Error initializing configured OC>
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=warning msg="Error initializing configured OC>
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=info msg="Found CNI network cni0 (type=calico>
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=info msg="Found CNI network podman (type=brid>
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=warning msg="Default CNI network name podman >
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=info msg="Setting parallel job count to 13"
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=info msg="using systemd socket activation to >
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=info msg="using API endpoint: ''"
Sep 14 08:48:59 master1 podman[19988]: time="2023-09-14T08:48:59+09:00" level=info msg="API server listening on \"/run/podm>
Sep 14 08:48:59 master1 systemd[1]: Started Podman API Service.

crun과 kata라는 컨테이너 런타임이 설치되지 않아 podman이 시작되지 않음

podman이 기동되지 않응 문제 해결 후 systemctl start tmaxcloud 했을 때 오류 발생

journalctl -xeu tmaxcloud로 확인

-- Unit tmaxcloud.service has begun starting up.
Sep 14 08:32:10 master1 podman[1856]: time="2023-09-14T08:32:10+09:00" level=error msg="Error adding network: failed to allocate for range 0: 10.88.0.13 has been allocated to 9c29c2ce40459ee7436e9eb09b8486a1db26d98ac1db6c88fb5472128b3af917, duplicate allocation is not allowed"
Sep 14 08:32:10 master1 podman[1856]: time="2023-09-14T08:32:10+09:00" level=error msg="Error while adding pod to CNI network \"podman\": failed to allocate for range 0: 10.88.0.13 has been allocated to 9c29c2ce40459ee7436e9eb09b8486a1db26d98ac1db6c88fb5472128b3af917, duplicate allocation is not allowed"
Sep 14 08:32:10 master1 podman[1856]: Error: unable to start container "9c29c2ce40459ee7436e9eb09b8486a1db26d98ac1db6c88fb5472128b3af917": error configuring network namespace for container 9c29c2ce40459ee7436e9eb09b8486a1db26d98ac1db6c88fb5472128b3af917: failed to allocate for range 0: 10.88.0.13 has been allocated to 9c29c2ce40459ee7436e9eb09b8486a1db26d98ac1db6c88fb5472128b3af917, duplicate allocation is not allowed
Sep 14 08:32:10 master1 systemd[1]: tmaxcloud.service: Control process exited, code=exited status=125
Sep 14 08:32:10 master1 podman[1943]: 9c29c2ce40459ee7436e9eb09b8486a1db26d98ac1db6c88fb5472128b3af917
Sep 14 08:32:10 master1 systemd[1]: tmaxcloud.service: Failed with result 'exit-code'.
Sep 14 08:32:10 master1 systemd[1]: Failed to start Podman container-9c29c2ce40459ee7436e9eb09b8486a1db26d98ac1db6c88fb5472128b3af917.service.
-- Subject: Unit tmaxcloud.service has failed

해당 ip는 노드가 꺼지기 전 이미지 레지스트리 컨테이너의 주소
정상적으로 컨테이너가 재기동 되었다면 해당 파일이 삭제된 후 다시 생성되어야 하나 비정상적으로 종료되어 삭제되지 않음
이로 인해 이미지 레지스트리를 실행시킬 때 중복된 ip가 존재하여 실행되지 못한다는 에러 발생

문제 해결

podman 컨테이너 런타임 설정

# vi /etc/containers/containers.conf

[engine]
runtime="runc"

systemctl start podman
- podman 기동
이미지 레지스트리 ip파일 삭제
```
rm /var/lib/cni/networks/podman/10.88.0.13
```
- 해당 파일 삭제 후 이미지 레지스트리 정상 동작 확인
```
systemctl status tmaxcloud
podman ps
```
ImagePullBackOff 에러가 뜨는 이미지 전부 재기동
- 특정 namespace 내의 파드 전부 재기동
```
kubectl -n {NAMESPACE} rollout restart deploy
```

참조

sanggyun bak

컴퓨터공학과 학생

이전 포스트

영속성 관리

다음 포스트

예상 못한 노드 재기동후 ImagePullBackOff

예상 못한 노드 재기동후 ImagePullBackOff

문제 상황

문제 원인

문제 해결

참조

영속성 관리

예상 못한 노드 재기동후 ImagePullBackOff

0개의 댓글