MLOps Error Workaround

문주은·2024년 1월 16일

Error name : OutOfMemory

1) Detail

pipeline.py 실행 시 OOMKilled 에러 발생

This step is in Error state with this message: OOMKilled (exit code 137)

2) Reason

OutOfMemory로 메모리 부족으로 인한 에러 발생

3) Solutions

Solution1) 데이터 크기 초적화

numpy array를 npz 파일로 데이터 압축

cv2 resize

cv2.resize(img, (256, 256), fx=0.5, fy=0.5)

Solution2) pod CPU 리소스 조정
resources:
  limits:
    cpu: "2"
  requests:
    cpu: "1"
request, limit 에 필요한 메모리

request : 컨테이너가 필요로 하는 최소 CPU 리소스

limits : 컨테이너가 사용할 수 있는 최대 CPU 리소스

Solution3) pod memory 리소스 조정
resources:
  limits:
    memory: "1Gi"
  requests:
    memory: "1Gi"

Error name : minikube stop 되는 이슈

1) Detail

Unable to connect to the server: dial tcp 192.168.49.2:8443: connect: no route to host

2) Reason

여러 원인 존재

메모리 부족
-> 리소스 문제인지 확인하는 방법(메모리 할당량 확인) : $ minikube ssh -- free -h : minikube
일시적 중단

3) Solutions

Solution1) 일시적 중단일 때 해결
$ minikube start

Solution2) 시스템 서비스 등록으로 임시적 해결

# minikube-start.sh
#!/bin/bash
minikube start

$ chmod +x minikube-start.sh

$ sudo nano /etc/systemd/system/minikube-start.service
[Unit]
Description=Start Minikube
After=network.target
[Service]
ExecStart=/path/to/minikube-start.sh
[Install]
WantedBy=default.target

$ sudo systemctl enable minikube-start.service
$ sudo systemctl start minikube-start.service

Error name : kubeflow pipeline cache 이슈

1) Detail

kubeflow pipeline에서 코드를 변경했음에도 불구하고 이전 코드 결과 값이 나오는 이슈

2) Reason

이전 버전의 코드가 캐시되었을 경우

3) Solutions

pipeline.py 코드 내에서 해결

task = dsl.ContainerOp(...)
task.execution_options.caching_strategy.max_cache_staleness = "P0D"

Error name : apt-get update 구문에서 이슈

1) Detail

 => ERROR [ 6/10] RUN apt-get update
...
#8 0.575 E: The repository 'http://deb.debian.org/debian bullseye-updates InRelease' is not signed. 
------
process "/bin/sh -c apt-get update" did not complete successfully: exit code: 100

2) Reason

원인1 - Docker는 이미지 빌드 시 캐시를 사용해서 빌드 시간 단축하는데, apt-get update 명령이 수행된 이후에 변경된 패키지 정보가 이미지 캐시에 저장되어 있어 업데이트된 정보를 사용하지 못하는 경우

3) Solutions

Solution1) 캐시 비활성화
$ docker image build --no-cache -t {이미지_이름}:{태그} .

Solution2) systemd restart

docker service restart
$ sudo systemctl restart docker

Solution3) systemd restart

사용하지 않는 컨테이너 일괄 삭제
$ docker builder prune

https://stackoverflow.com/questions/62473932/atleast-one-invalid-signature-was-encountered error 참고

Error name : minikube GPU 설정 시 에러

1) Detail

Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

2) Reason

NVIDIA GPU와 관련된 프로그램이나 라이브러리에서 발생
NVML 라이브러리(libnvidia-ml.so.1)가 시스템에 설치되어 있는지 확인 필요

3) Solutions

Solution1) daemon.json 변경

minikube가 켜져 있는 상태에서 /etc/docker/daemon.json 변경

확인 사항
$ ldconfig -p | grep nvidia-ml
        libnvidia-ml.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so.1
        libnvidia-ml.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libnvidia-ml.so

Error name : Still waiting on: "kubernetes"

1) Detail

coredns kubernetes 플러그인 초기화하고 대기중

INFO] plugin/ready: Still waiting on: "kubernetes"
INFO] plugin/ready: Still waiting on: "kubernetes"

2) Reason

3) Solutions

이벤트 로그 확인

$ kubectl get events -n kube-system
6s  Warning  Unhealthy  pod/coredns-78fcd69978-2sdzb   
Readiness probe failed: HTTP probe failed with statuscode: 503

포트 일치 (coredns & readinessProbe)
$ kubectl get pod coredns-78fcd69978-2sdzb -n kube-system -o yaml

name: coredns
ports:
- containerPort: 53   ## coredns 53포트 
  name: dns
  protocol: UDP
- containerPort: 53
  name: dns-tcp
  protocol: TCP
- containerPort: 9153
  name: metrics
  protocol: TCP 
readinessProbe:
  failureThreshold: 3
  httpGet:
    path: /ready
    port: 8181      ## readinessProbe 8181 -> 53으로 변경
    scheme: HTTP
  periodSeconds: 10
  successThreshold: 1
  timeoutSeconds: 1

Error name : scheduler in kube-system unhealthy

1) Detail

Get "http://127.0.0.1:10251/healthz": dial tcp 127.0.0.1:10251: connect: connection refused

2) Reason

3) Solutions

https://sarc.io/index.php/cloud/2179-k8s-componentstatuses-unhealth 참고
(port=0 delete)

Template

Error name :

1) Detail

2) Reason

3) Solutions

Solution1) AAA
Solution2) BBB

문주은

Data Engineer

이전 포스트

MinIO

다음 포스트

MLOps Error Workaround

Error name : OutOfMemory

1) Detail

2) Reason

3) Solutions

Error name : minikube stop 되는 이슈

1) Detail

2) Reason

3) Solutions

Error name : kubeflow pipeline cache 이슈

1) Detail

2) Reason

3) Solutions

Error name : apt-get update 구문에서 이슈

1) Detail

2) Reason

3) Solutions

Error name : minikube GPU 설정 시 에러

1) Detail

2) Reason

3) Solutions

Error name : Still waiting on: "kubernetes"

1) Detail

2) Reason

3) Solutions

Error name : scheduler in kube-system unhealthy

1) Detail

2) Reason

3) Solutions

Template

Error name :

1) Detail

2) Reason

3) Solutions

MinIO

K8s Cheetsheet

0개의 댓글