Kubernetes - Prometheus OOMKilled 이슈 트러블 슈팅

From_A_To_Z·2024년 1월 30일

1. 이슈

K8S 클러스터 내 Prometheus 파드에 OOMKilled 에러 발생
K8S 내 파드 대량 생성으로 인해 발생한 리소스 자원 (Memory)이 부족하게 되어 Prometheus 비정상 종료되었는데 그 이후 지속적으로 발생하고 있음 -> 파드 정리 후 Memory가 확보된 이후에도 동일 에러 발생
Prometheus 파드는 OOMKilled 이후, 다시 초기화되어 실행되며 다시 OOMKilled 에러 발생 (CrashLoopBackOff)

2. 원인

Prometheus는 비정상 종료되는 crash가 발생할 경우 재기동 시점에 현재 존재하는 WAL을 다시 읽어들여 원래 데이터를 복구하는 replay 작업을 수행함
해당 replay 과정에서 WAL 파일에 데이터는 인메모리 버퍼에 저장되는데 해당 WAL 파일 갯수가 많을수록 필요한 메모리 사이즈가 커짐 (WAL 파일 최대 사이즈: 128MB)
결과적으로 복구해야될 WAL 데이터 크기가 Prometheus에 할당된 메모리보다 크게 될 경우 OOM 에러가 발생

3. 해결

Prometheus 컨테이너의 메모리 할당량을 기존 8000Mi 에서 16000Mi로 늘림 → 정상 기동 확인

        resources:
          limits:
            cpu: "2"
            memory: 8000Mi --> 16000Mi
          requests:
            cpu: "2"
            memory: 8000Mi --> 16000Mi

정상 기동 이후, 메모리 사이즈를 다시 원상 복구 (8000Mi)하고 재기동하여도 정상 동작함 → 이전 비정상 종료했을 때와 비교했을 때 복구해야되는 WAL 데이터의 사이즈가 작기 때문에 정상 재기동한 것으로 추정

ts=2023-05-18T01:00:36.619Z caller=head.go:493 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2023-05-18T01:00:36.670Z caller=head.go:536 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=51.105706ms
ts=2023-05-18T01:00:36.670Z caller=head.go:542 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2023-05-18T01:00:46.073Z caller=head.go:578 level=info component=tsdb msg="WAL checkpoint loaded"
ts=2023-05-18T01:00:48.380Z caller=head.go:613 level=info component=tsdb msg="WAL segment loaded" segment=3945 maxSegment=3948
ts=2023-05-18T01:00:54.070Z caller=head.go:613 level=info component=tsdb msg="WAL segment loaded" segment=3946 maxSegment=3948
ts=2023-05-18T01:00:54.175Z caller=head.go:613 level=info component=tsdb msg="WAL segment loaded" segment=3947 maxSegment=3948
ts=2023-05-18T01:00:54.176Z caller=head.go:613 level=info component=tsdb msg="WAL segment loaded" segment=3948 maxSegment=3948
ts=2023-05-18T01:00:54.176Z caller=head.go:619 level=info component=tsdb msg="WAL replay completed" checkpoint_replay_duration=9.402768009s wal_replay_duration=8.103580105s total_replay_duration=17.557514431s
ts=2023-05-18T01:00:54.733Z caller=main.go:991 level=warn fs_type=NFS_SUPER_MAGIC msg="This filesystem is not supported and may lead to data corruption and data loss. Please carefully read https://prometheus.io/docs/prometheus/latest/storage/ to learn more about supported filesystems."
ts=2023-05-18T01:00:54.733Z caller=main.go:996 level=info msg="TSDB started"
ts=2023-05-18T01:00:54.734Z caller=main.go:1177 level=info msg="Loading configuration file" filename=/etc/config/prometheus.yml
ts=2023-05-18T01:00:54.736Z caller=kubernetes.go:325 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-05-18T01:00:54.737Z caller=kubernetes.go:325 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-05-18T01:00:54.737Z caller=kubernetes.go:325 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-05-18T01:00:54.737Z caller=kubernetes.go:325 level=info component="discovery manager scrape" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-05-18T01:00:54.737Z caller=kubernetes.go:325 level=info component="discovery manager notify" discovery=kubernetes msg="Using pod service account via in-cluster config"
ts=2023-05-18T01:00:54.737Z caller=main.go:1214 level=info msg="Completed loading of configuration file" filename=/etc/config/prometheus.yml totalDuration=3.935451ms db_storage=1.792µs remote_storage=4.519µs web_handler=676ns query_engine=1.473µs scrape=277.815µs scrape_sd=1.094468ms notify=72.292µs notify_sd=174.763µs rules=30.14µs tracing=10.05µs
ts=2023-05-18T01:00:54.737Z caller=main.go:957 level=info msg="Server is ready to receive web requests."
ts=2023-05-18T01:00:54.738Z caller=manager.go:937 level=info component="rule manager" msg="Starting rule manager..."

→ 정상 기동 이후 재기동 시, 비정상 종료했을 때보다 더 적은 WAL Segment가 load 되는 것을 확인 가능

참고

https://engineering.linecorp.com/ko/blog/prometheus-container-kubernetes-cluster
https://stackoverflow.com/questions/63541085/kubernetes-prometheus-crashloopbackoff-oomkilled-puzzle
https://blog.naver.com/PostView.nhn?blogId=alice_k106&logNo=221829384846

From_A_To_Z

What goes around comes around.

이전 포스트

EXT4 Filesystem

다음 포스트

Kubernetes - Prometheus OOMKilled 이슈 트러블 슈팅

1. 이슈

2. 원인

3. 해결

참고

EXT4 Filesystem

Java Decompiler - 자바 클래스의 코드 내용을 확인하는 도구

0개의 댓글