리소스 정의 파일로 pod들이 생성되는데 생각보다 오래걸림.
- 즉, kube-apiserver, controller manager, scheduler 셋 중 어딘가 문제가 있다.
root@<kube-master>:~# sudo crictl ps
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
ce17c216c9611 4d9d9de55f196 9 minutes ago Running kube-controller-manager 4025 205df5e6622f2 kube-controller-manager-k8s-master
3100343af98d5 309c26d006295 21 minutes ago Running kube-scheduler 3943 5c90d60b8bfad kube-scheduler-k8s-master
2c133c613c502 eeb80ea665767 2 days ago Running kube-apiserver 215 d62617dada1c6 kube-apiserver-k8s-master
7711053eb2af1 73deb9a3f7025 2 days ago Running etcd 26 7533381053725 etcd-k8s-master
4f21e9a480e21 ead0a4a53df89 5 days ago Running coredns 3 a38e3365acb5d coredns-5dd5756b68-928b2
be5c2ee18d477 ead0a4a53df89 5 days ago Running coredns 3 f2d88132e9736 coredns-5dd5756b68-4jc6k
57a0b0e6877a5 4e622d4011f55 5 days ago Running speaker 2 ba2339113c7da speaker-zl5wd
a067721b89152 f9c73fde068fd 5 days ago Running kube-flannel 3 777fd25e32860 kube-flannel-ds-zcdjh
6d08b78d7fe2b 123aa721f941b 5 days ago Running kube-proxy 3 8a36bdfc17dc9 kube-proxy-4g256
root@<kube-master>:~# sudo crictl logs 3100343af98d5
I0808 00:00:12.354066 1 serving.go:348] Generated self-signed cert in-memory
I0808 00:00:13.068860 1 server.go:154] "Starting Kubernetes Scheduler" version="v1.28.7"
I0808 00:00:13.068950 1 server.go:156] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
I0808 00:00:13.073182 1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0808 00:00:13.073200 1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0808 00:00:13.073205 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0808 00:00:13.073224 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0808 00:00:13.073248 1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::client-ca-file"
I0808 00:00:13.073346 1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0808 00:00:13.074034 1 secure_serving.go:213] Serving securely on 127.0.0.1:10259
I0808 00:00:13.074090 1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0808 00:00:13.173402 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0808 00:00:13.173493 1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0808 00:00:13.173581 1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0808 00:00:13.174565 1 leaderelection.go:250] attempting to acquire leader lease kube-system/kube-scheduler...
I0808 00:00:29.617546 1 leaderelection.go:260] successfully acquired lease kube-system/kube-scheduler
E0808 00:08:24.871709 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://<IP>:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0808 00:08:51.423650 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://<IP>:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0808 00:18:31.436091 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://<IP>:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0808 00:20:09.799633 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-scheduler: Get "https://<IP>:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
- Wow! 리더 선출을 위한 리소스 잠금을 가져오는 도중 네트워크 요청 시간이 초과되었군요!
- API 서버와의 통신 문제나 네트워크 구성 문제일 수 있겠군요!
...
...
17935, status: Complete"
E0808 <TIMESTAMP> 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: Get "https://<IP>:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
- Wow! controller-manager 또한 리더 선출 관련 동일한 에러군요!
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Mon 2024-08-05 23:56:13 UTC; 2 days ago
Docs: https://kubernetes.io/docs/home/
Main PID: 1453132 (kubelet)
Tasks: 15 (limit: 38334)
Memory: 55.2M
CPU: 59min 49.062s
CGroup: /system.slice/kubelet.service
└─1453132 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime-endpoint=unix:///var/run/co>
Aug 08 00:10:43 k8s-master kubelet[1453132]: E0808 00:10:43.476067 1453132 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0>
Aug 08 00:10:54 k8s-master kubelet[1453132]: I0808 00:10:54.473747 1453132 scope.go:117] "RemoveContainer" containerID="82d4ee1640dbc9a59035f240093f4c24402545914ae6f416e27e4795ec0ea690"
Aug 08 00:10:54 k8s-master kubelet[1453132]: E0808 00:10:54.474691 1453132 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0>
Aug 08 00:11:08 k8s-master kubelet[1453132]: I0808 00:11:08.473763 1453132 scope.go:117] "RemoveContainer" containerID="82d4ee1640dbc9a59035f240093f4c24402545914ae6f416e27e4795ec0ea690"
Aug 08 00:11:08 k8s-master kubelet[1453132]: E0808 00:11:08.474199 1453132 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0>
Aug 08 00:11:22 k8s-master kubelet[1453132]: I0808 00:11:22.473723 1453132 scope.go:117] "RemoveContainer" containerID="82d4ee1640dbc9a59035f240093f4c24402545914ae6f416e27e4795ec0ea690"
Aug 08 00:11:22 k8s-master kubelet[1453132]: E0808 00:11:22.474200 1453132 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0>
Aug 08 00:11:34 k8s-master kubelet[1453132]: I0808 00:11:34.473047 1453132 scope.go:117] "RemoveContainer" containerID="82d4ee1640dbc9a59035f240093f4c24402545914ae6f416e27e4795ec0ea690"
Aug 08 00:11:34 k8s-master kubelet[1453132]: E0808 00:11:34.473496 1453132 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"kube-controller-manager\" with CrashLoopBackOff: \"back-off 5m0>
Aug 08 00:11:48 k8s-master kubelet[1453132]: I0808 00:11:48.473671 1453132 scope.go:117] "RemoveContainer" containerID="82d4ee1640dbc9a59035f240093f4c24402545914ae6f416e27e4795ec0ea690"
~
- ?????????????????????????????
- controller-manager 파드가 CRashLookBackOff 였단 말이야~~~??
하긴 그랬으니까 지속적으로 재시작하고 에러가 나고 그랬겠지
I0808 00:32:51.292557 1 shared_informer.go:318] Caches are synced for bootstrap_signer
I0808 00:32:51.342206 1 shared_informer.go:318] Caches are synced for persistent volume
I0808 00:32:51.684572 1 shared_informer.go:318] Caches are synced for garbage collector
I0808 00:32:51.684623 1 garbagecollector.go:166] "All resource monitors have synced. Proceeding to collect garbage"
I0808 00:32:51.689698 1 shared_informer.go:318] Caches are synced for garbage collector
E0808 00:33:07.745821 1 leaderelection.go:332] error retrieving resource lock kube-system/kube-controller-manager: the server was unable to return a response in the time allotted, but may still be processing the request (get leases.coordination.k8s.io kube-controller-manager)
E0808 00:33:12.745256 1 leaderelection.go:369] Failed to update lock: Put "https://<IP>:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded
I0808 00:33:12.745316 1 leaderelection.go:285] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
E0808 00:33:12.745447 1 controllermanager.go:302] "leaderelection lost"
root@<kubs-master>:~#
root@<kubs-master>:~# k get po -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-5dd5756b68-4jc6k 1/1 Running 3 (5d15h ago) 166d
coredns-5dd5756b68-928b2 1/1 Running 3 (5d15h ago) 166d
etcd-k8s-master 1/1 Running 26 166d
kube-apiserver-k8s-master 1/1 Running 215 (2d ago) 153d
kube-controller-manager-k8s-master 0/1 CrashLoopBackOff 4026 (22s ago) 166d
kube-proxy-4g256 1/1 Running 3 (5d15h ago) 166d
kube-proxy-fswcm 1/1 Running 6 (5d15h ago) 166d
kube-proxy-wvx9d 1/1 Running 4 (5d15h ago) 166d
kube-scheduler-k8s-master 1/1 Running 3945 (72s ago) 166d
metrics-server-5f8c4dcf65-phhlz 1/1 Running 3 (5d15h ago) 92d
캬~ 발견해버렸쥬~?
kube-controller-manager가 CrashLoopBackOff되는 이유는 바로
------------------------
리더 선출 과정에서 문제가 발생했기 때문이다.
근데 왜 리더 선출 과정에서 문제가 생기고
네트워크 요청 시간이 초과되는지 알아보진 못했다.
------------------------
I0808 00:40:08.854139 1 trace.go:236] Trace[1126528139]: "Get" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:70118753-a279-4795-b417-170bc8b02bb2,client:192.168.3.227,protocol:HTTP/2.0,resource:persistentvolumeclaims,scope:resource,url:/api/v1/namespaces/airflow/persistentvolumeclaims/airflow-dags,user-agent:kubelet/v1.28.2 (linux/amd64) kubernetes/89a4ea3,verb:GET (08-Aug-2024 00:40:06.713) (total time: 2140ms):
Trace[1126528139]: ---"About to write a response" 2140ms (00:40:08.854)
Trace[1126528139]: [2.140851316s] [2.140851316s] END
I0808 00:40:08.854460 1 trace.go:236] Trace[663836034]: "Get" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:c5c08a9e-bc70-4094-afe1-81d3940f363c,client:192.168.3.226,protocol:HTTP/2.0,resource:persistentvolumeclaims,scope:resource,url:/api/v1/namespaces/airflow/persistentvolumeclaims/airflow-logs,user-agent:kubelet/v1.28.2 (linux/amd64) kubernetes/89a4ea3,verb:GET (08-Aug-2024 00:40:06.668) (total time: 2185ms):
Trace[663836034]: ---"About to write a response" 2185ms (00:40:08.854)
Trace[663836034]: [2.185926705s] [2.185926705s] END
I0808 00:40:08.863238 1 trace.go:236] Trace[1606179250]: "Create" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:772026c0-7a52-4e7b-992d-4d4edc48fef6,client:192.168.3.226,protocol:HTTP/2.0,resource:serviceaccounts,scope:resource,url:/api/v1/namespaces/airflow/serviceaccounts/airflow-worker/token,user-agent:kubelet/v1.28.2 (linux/amd64) kubernetes/89a4ea3,verb:POST (08-Aug-2024 00:40:06.769) (total time: 2093ms):
Trace[1606179250]: ---"Write to database call succeeded" len:186 2093ms (00:40:08.862)
Trace[1606179250]: [2.093999119s] [2.093999119s] END
I0808 00:40:09.842708 1 trace.go:236] Trace[1353114645]: "Patch" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:e3dc5af4-3bf3-45d4-bcd5-049af8d5a42d,client:192.168.3.225,protocol:HTTP/2.0,resource:events,scope:resource,url:/api/v1/namespaces/kube-system/events/kube-apiserver-k8s-master.17e8faa4c17c8d8e,user-agent:kubelet/v1.28.2 (linux/amd64) kubernetes/89a4ea3,verb:PATCH (08-Aug-2024 00:40:08.858) (total time: 984ms):
Trace[1353114645]: ["GuaranteedUpdate etcd3" audit-id:e3dc5af4-3bf3-45d4-bcd5-049af8d5a42d,key:/events/kube-system/kube-apiserver-k8s-master.17e8faa4c17c8d8e,type:*core.Event,resource:events 984ms (00:40:08.858)
Trace[1353114645]: ---"Txn call completed" 962ms (00:40:09.842)]
Trace[1353114645]: ---"Object stored in database" 983ms (00:40:09.842)
Trace[1353114645]: [984.603692ms] [984.603692ms] END
I0808 00:40:09.842904 1 trace.go:236] Trace[899532860]: "Get" accept:application/vnd.kubernetes.protobuf,application/json,audit-id:07bff618-2875-4793-9768-d12f8a618d95,client:192.168.3.227,protocol:HTTP/2.0,resource:persistentvolumes,scope:resource,url:/api/v1/persistentvolumes/airflow-logs-pv,user-agent:kubelet/v1.28.2 (linux/amd64) kubernetes/89a4ea3,verb:GET (08-Aug-2024 00:40:08.882) (total time: 960ms):
Trace[899532860]: ---"About to write a response" 960ms (00:40:09.842)
Trace[899532860]: [960.4954ms] [960.4954ms] END
- kube-apiserver에서 특정 요청들이 처리되는데 너무 오래걸린다
Controller-Manager에서 발생하는 context deadline exceeded 및 timed out waiting for the condition 오류들은 kube-apiserver가 ETCD로부터 재시간에 응답을 받지 못해서 생긴 일이다. -------------------------------------------------------- 즉, etcd와 api-server의 상태를 모두 확인해야한다. --------------------------------------------------------
root@<kube-master>:~# etcdctl --endpoints=http://127.0.0.1:2379 cluster-health
cluster may be unhealthy: failed to list members
Error: client: etcd cluster is unavailable or misconfigured; error #0: EOF
error #0: EOF
root@k8s-master:~# systemctl status etcd
○ etcd.service - etcd - highly-available key value store
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2024-08-05 23:58:36 UTC; 2 days ago
Docs: https://etcd.io/docs
man:etcd
Process: 1453067 ExecStart=/usr/bin/etcd $DAEMON_ARGS (code=killed, signal=TERM)
Main PID: 1453067 (code=killed, signal=TERM)
CPU: 690ms
Aug 05 23:56:10 k8s-master etcd[1453067]: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 8
Aug 05 23:56:10 k8s-master etcd[1453067]: published {Name:k8s-master ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8c32
Aug 05 23:56:10 k8s-master etcd[1453067]: ready to serve client requests
Aug 05 23:56:10 k8s-master systemd[1]: Started etcd - highly-available key value store.
Aug 05 23:56:10 k8s-master etcd[1453067]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!
Aug 05 23:58:36 k8s-master systemd[1]: Stopping etcd - highly-available key value store...
Aug 05 23:58:36 k8s-master etcd[1453067]: received terminated signal, shutting down...
Aug 05 23:58:36 k8s-master etcd[1453067]: skipped leadership transfer for single member cluster
Aug 05 23:58:36 k8s-master systemd[1]: etcd.service: Deactivated successfully.
Aug 05 23:58:36 k8s-master systemd[1]: Stopped etcd - highly-available key value store.
root@<kube-master>:~# systemctl start etcd
Job for etcd.service failed because the control process exited with error code.
See "systemctl status etcd.service" and "journalctl -xeu etcd.service" for details.
root@<kube-master>:~# systemctl status etcd.service
× etcd.service - etcd - highly-available key value store
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Thu 2024-08-08 00:54:11 UTC; 14s ago
Docs: https://etcd.io/docs
man:etcd
Process: 2260046 ExecStart=/usr/bin/etcd $DAEMON_ARGS (code=exited, status=1/FAILURE)
Main PID: 2260046 (code=exited, status=1/FAILURE)
CPU: 16ms
Aug 08 00:54:11 k8s-master etcd[2260046]: Git SHA: Not provided (use ./build instead of go build)
Aug 08 00:54:11 k8s-master etcd[2260046]: Go Version: go1.18.1
Aug 08 00:54:11 k8s-master etcd[2260046]: Go OS/Arch: linux/amd64
Aug 08 00:54:11 k8s-master etcd[2260046]: setting maximum number of CPUs to 4, total number of available CPUs is 4
Aug 08 00:54:11 k8s-master etcd[2260046]: the server is already initialized as member before, starting as etcd member...
Aug 08 00:54:11 k8s-master etcd[2260046]: listening for peers on http://localhost:2380
Aug 08 00:54:11 k8s-master etcd[2260046]: listen tcp 127.0.0.1:2379: bind: address already in use
Aug 08 00:54:11 k8s-master systemd[1]: etcd.service: Main process exited, code=exited, status=1/FAILURE
Aug 08 00:54:11 k8s-master systemd[1]: etcd.service: Failed with result 'exit-code'.
Aug 08 00:54:11 k8s-master systemd[1]: Failed to start etcd - highly-available key value store.
sudo systemctl stop etcd
sudo kill -9 $(pgrep etcd)
sudo systemctl start etcd
sudo systemctl staus etcd.service
● etcd.service - etcd - highly-available key value store
Loaded: loaded (/lib/systemd/system/etcd.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2024-08-08 01:00:14 UTC; 1min 39s ago
Docs: https://etcd.io/docs
man:etcd
Main PID: 2262281 (etcd)
Tasks: 9 (limit: 38334)
Memory: 7.2M
CPU: 663ms
CGroup: /system.slice/etcd.service
└─2262281 /usr/bin/etcd
Aug 08 01:00:11 k8s-master etcd[2262281]: enabled capabilities for version 3.3
Aug 08 01:00:13 k8s-master etcd[2262281]: 8e9e05c52164694d is starting a new election at term 8
Aug 08 01:00:13 k8s-master etcd[2262281]: 8e9e05c52164694d became candidate at term 9
Aug 08 01:00:13 k8s-master etcd[2262281]: 8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 9
Aug 08 01:00:13 k8s-master etcd[2262281]: 8e9e05c52164694d became leader at term 9
Aug 08 01:00:13 k8s-master etcd[2262281]: raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 9
Aug 08 01:00:14 k8s-master etcd[2262281]: ready to serve client requests
Aug 08 01:00:14 k8s-master etcd[2262281]: published {Name:k8s-master ClientURLs:[http://localhost:2379]} to cluster cdf818194e3a8c32
Aug 08 01:00:14 k8s-master systemd[1]: Started etcd - highly-available key value store.
Aug 08 01:00:14 k8s-master etcd[2262281]: serving insecure client requests on 127.0.0.1:2379, this is strongly discouraged!