에러 해결 : Katib Pods Running 상태로 미진행 오류

노하람·2021년 12월 6일
1

상황

  • 결론부터 말씀드리자면 random 하이퍼 파라미터 예제에서 Katib에 에러가 있었던 이유는 TFjob 관련하여 sidecar-inject를 추가로 설정해주지 않았기 때문입니다. 해결방법은 쭉 아래로 내려가시면 확인하실 수 있습니다.
  1. kubeflow 대시보드에는 pod가 Running 상태로 지속됌.
  2. Lens로 Pod 상태 확인했을 때 역시 Status가 Running 상태임.
  3. kubectl get pods -A로 확인했을 땐 일부 pods가 Error 상태임.
  4. kubectl get jobs -A로 확인했을 때 Katib 관련 job이 하나도 COMPLETIONS 된 것이 없음.
  5. experiments도 정상 작동중임.
  6. trials도 정상 작동중임.
  7. Error가 발생한 팟의 describe 내용은 아래와 같음.
(base) noharam@nohalam-ui-MacBookPro ~ % kubectl describe pod/random-qwllpfh9-mgqtd -n moey920
Name:         random-qwllpfh9-mgqtd
Namespace:    moey920
Priority:     0
Node:         ip-192-168-26-121.ap-northeast-2.compute.internal/192.168.26.121
Start Time:   Fri, 03 Dec 2021 18:04:59 +0900
Labels:       controller-uid=75ec3557-7bb3-47cf-b518-379f5cda74b2
              job-name=random-qwllpfh9
Annotations:  kubernetes.io/psp: eks.privileged
              sidecar.istio.io/status:
                {"version":"5f3ae3613c7945ef767cb9fd594596bc001ff3ab915f12e4379c0cb5648d2729","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status:       Running
IP:           192.168.26.112
IPs:
  IP:           192.168.26.112
Controlled By:  Job/random-qwllpfh9
Init Containers:
  istio-init:
    Container ID:  containerd://f5431ad2e212cc46b98a0024281420c71c50d64ec2ea1a2f5092881597eed031
    Image:         docker.io/istio/proxy_init:1.1.6
    Image ID:      docker.io/istio/proxy_init@sha256:54d89fb2b3b0a2365f2d2b0a8862f1f8320a63ab6a09c637c60f13f6021c4609
    Port:          <none>
    Host Port:     <none>
    Args:
      -p
      15001
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      
      -d
      15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 03 Dec 2021 18:05:00 +0900
      Finished:     Fri, 03 Dec 2021 18:05:00 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:        10m
      memory:     10Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
Containers:
  training-container:
    Container ID:  containerd://59c4a5604a91435ec31cb6003abf13f52782136d73a2e418ea1e19bbc352063b
    Image:         docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727
    Image ID:      docker.io/kubeflowkatib/mxnet-mnist@sha256:9bbfc47d1fc369e79d0b4e83f26b3060941eb0d0792c758a4ce27b4bd90a6c48
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      python3 /opt/mxnet-mnist/mnist.py --batch-size=64 --lr=0.012918714475815489 --num-layers=4 --optimizer=adam 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 03 Dec 2021 18:05:30 +0900
      Finished:     Fri, 03 Dec 2021 18:05:32 +0900
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
  istio-proxy:
    Container ID:  containerd://d8bc2a269feaa77fc5ab0f3ecb67f062c65c3452e75971ef67137bafef1aa586
    Image:         docker.io/istio/proxyv2:1.1.6
    Image ID:      docker.io/istio/proxyv2@sha256:e7ee1ad38bd5b556ad0527ac691a9f647b66835960417b154c5d28b2ed9219cb
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --configPath
      /etc/istio/proxy
      --binaryPath
      /usr/local/bin/envoy
      --serviceCluster
      istio-proxy.moey920
      --drainDuration
      45s
      --parentShutdownDuration
      1m0s
      --discoveryAddress
      istio-pilot.istio-system:15010
      --zipkinAddress
      zipkin.istio-system:9411
      --connectTimeout
      10s
      --proxyAdminPort
      15000
      --concurrency
      2
      --controlPlaneAuthPolicy
      NONE
      --statusPort
      15020
      --applicationPorts
      
    State:          Running
      Started:      Fri, 03 Dec 2021 18:05:31 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  128Mi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
    Environment:
      POD_NAME:                      random-qwllpfh9-mgqtd (v1:metadata.name)
      POD_NAMESPACE:                 moey920 (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      ISTIO_META_POD_NAME:           random-qwllpfh9-mgqtd (v1:metadata.name)
      ISTIO_META_CONFIG_NAMESPACE:   moey920 (v1:metadata.namespace)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"kubernetes.io/psp":"eks.privileged"}
                                     
      ISTIO_METAJSON_LABELS:         {"controller-uid":"75ec3557-7bb3-47cf-b518-379f5cda74b2","job-name":"random-qwllpfh9"}
                                     
    Mounts:
      /etc/certs/ from istio-certs (ro)
      /etc/istio/proxy from istio-envoy (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
  metrics-logger-and-collector:
    Container ID:  containerd://c73729be7f64d7ddf01c6ede062e6aa505dbf621053eeb68c42d00cf69ee3543
    Image:         docker.io/kubeflowkatib/file-metrics-collector:v1beta1-a96ff59
    Image ID:      docker.io/kubeflowkatib/file-metrics-collector@sha256:f262616f5adea780dacaabfdd1c8338b7c9eb7bd16088ae2acdc1887a0020869
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      random-qwllpfh9
      -m
      Validation-accuracy;Train-accuracy
      -o-type
      maximize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /var/log/katib/metrics.log
    State:          Running
      Started:      Fri, 03 Dec 2021 18:05:39 +0900
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-hfjl5:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  istio.default
    Optional:    true
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>
  1. kubectl logs -f pod/random-qwllpfh9-mgqtd -n moey920 --all-containers를 확인해보니 특정 에러가 지속적으로 발생함.
[2021-12-06 03:12:47.347][81][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13, 
[2021-12-06 03:12:48.213][81][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.

  1. Error가 발생한 다른 pod의 log를 살펴보니, 같은 문제가 지속되면서도 무언가 진행된 모습을 볼 수 있음.(Error가 발생한 3개의 pod 중 나머지 2개는 이러한 상태임)
[2021-12-06 03:34:04.476][76][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13, 
[2021-12-06 03:34:05.611][76][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
I1203 09:05:39.647637     102 main.go:136] 
I1203 09:05:39.647646     102 main.go:136] During handling of the above exception, another exception occurred:
I1203 09:05:39.647652     102 main.go:136] 
I1203 09:05:39.647660     102 main.go:136] Traceback (most recent call last):
I1203 09:05:39.647668     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 449, in send
I1203 09:05:39.647675     102 main.go:136]     timeout=timeout
+ for gid in '${PROXY_GID}'
+ iptables -t nat -A ISTIO_OUTPUT -m owner --gid-owner 1337 -j RETURN
+ iptables -t nat -A ISTIO_OUTPUT -d 127.0.0.1/32 -j RETURN
[2021-12-06 04:06:47.945][76][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13, 
[2021-12-06 04:06:49.014][76][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2021-12-06 04:34:53.461][76][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13, 
I1203 09:05:39.647678     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
I1203 09:05:39.647774     102 main.go:136]     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
I1203 09:05:39.647790     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/urllib3/util/retry.py", line 573, in increment
I1203 09:05:39.647808     102 main.go:136]     raise MaxRetryError(_pool, url, error or ResponseError(cause))
I1203 09:05:39.647811     102 main.go:136] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faa0d14ebe0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
[2021-12-06 04:34:54.407][76][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
I1203 09:05:39.647819     102 main.go:136] 
I1203 09:05:39.647833     102 main.go:136] During handling of the above exception, another exception occurred:
I1203 09:05:39.647850     102 main.go:136] 
I1203 09:05:39.647854     102 main.go:136] Traceback (most recent call last):
I1203 09:05:39.647861     102 main.go:136]   File "/opt/mxnet-mnist/mnist.py", line 86, in <module>
+ '[' -n '' ']'
+ '[' '*' == '*' ']'
+ iptables -t nat -A ISTIO_OUTPUT -j ISTIO_REDIRECT
+ set +o nounset
I1203 09:05:39.647870     102 main.go:136]     fit.fit(args, sym, get_mnist_iter)
+ '[' -n '' ']'
I1203 09:05:39.647891     102 main.go:136]   File "/opt/mxnet-mnist/common/fit.py", line 185, in fit
+ ip6tables -F INPUT
+ ip6tables -A INPUT -m state --state ESTABLISHED -j ACCEPT
+ ip6tables -A INPUT -i lo -d ::1 -j ACCEPT
+ ip6tables -A INPUT -j REJECT
+ dump
+ iptables-save
# Generated by iptables-save v1.6.0 on Fri Dec  3 09:04:59 2021
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
I1203 09:05:39.647896     102 main.go:136]     (train, val) = data_loader(args, kv)
I1203 09:05:39.647903     102 main.go:136]   File "/opt/mxnet-mnist/mnist.py", line 44, in get_mnist_iter
I1203 09:05:39.647907     102 main.go:136]     mnist = mx.test_utils.get_mnist()
I1203 09:05:39.647914     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1907, in get_mnist
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
COMMIT
# Completed on Fri Dec  3 09:04:59 2021
# Generated by iptables-save v1.6.0 on Fri Dec  3 09:04:59 2021
*nat
:PREROUTING ACCEPT [0:0]
I1203 09:05:39.647919     102 main.go:136]     path+'train-labels-idx1-ubyte.gz', path+'train-images-idx3-ubyte.gz')
I1203 09:05:39.647943     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1894, in read_data
I1203 09:05:39.647958     102 main.go:136]     with gzip.open(mx.test_utils.download(label_url)) as flbl:
I1203 09:05:39.647962     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1812, in download
I1203 09:05:39.647970     102 main.go:136]     raise e
I1203 09:05:39.647974     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1802, in download
I1203 09:05:39.647982     102 main.go:136]     r = requests.get(url, stream=True)
I1203 09:05:39.647991     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 76, in get
I1203 09:05:39.648002     102 main.go:136]     return request('get', url, params=params, **kwargs)
I1203 09:05:39.648008     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 61, in request
I1203 09:05:39.648015     102 main.go:136]     return session.request(method=method, url=url, **kwargs)
I1203 09:05:39.648019     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 542, in request
I1203 09:05:39.648037     102 main.go:136]     resp = self.send(prep, **send_kwargs)
I1203 09:05:39.648046     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 655, in send
I1203 09:05:39.648060     102 main.go:136]     r = adapter.send(request, **kwargs)
I1203 09:05:39.648072     102 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 516, in send
I1203 09:05:39.648079     102 main.go:136]     raise ConnectionError(e, request=request)
I1203 09:05:39.648111     102 main.go:136] requests.exceptions.ConnectionError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faa0d14ebe0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:ISTIO_IN_REDIRECT - [0:0]
:ISTIO_OUTPUT - [0:0]
:ISTIO_REDIRECT - [0:0]
-A OUTPUT -p tcp -j ISTIO_OUTPUT
-A ISTIO_IN_REDIRECT -p tcp -j REDIRECT --to-ports 15001
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -o lo -j ISTIO_REDIRECT
-A ISTIO_OUTPUT -m owner --uid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -m owner --gid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -d 127.0.0.1/32 -j RETURN
-A ISTIO_OUTPUT -j ISTIO_REDIRECT
-A ISTIO_REDIRECT -p tcp -j REDIRECT --to-ports 15001
COMMIT
# Completed on Fri Dec  3 09:04:59 2021
+ ip6tables-save
# Generated by ip6tables-save v1.6.0 on Fri Dec  3 09:04:59 2021
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED -j ACCEPT
-A INPUT -d ::1/128 -i lo -j ACCEPT
-A INPUT -j REJECT --reject-with icmp6-port-unreachable
COMMIT
# Completed on Fri Dec  3 09:04:59 2021
  1. Running 상태의 나머지 pod 하나는 3개의 new Trial에서 Suggestions 리턴 값을 기다리고 있음.

추측

  1. istio-init 에러?
ENVOY_PORT=
ISTIO_INBOUND_INTERCEPTION_MODE=
ISTIO_INBOUND_TPROXY_MARK=
ISTIO_INBOUND_TPROXY_ROUTE_TABLE=
ISTIO_INBOUND_PORTS=
ISTIO_LOCAL_EXCLUDE_PORTS=
ISTIO_SERVICE_CIDR=
ISTIO_SERVICE_EXCLUDE_CIDR=
  1. 또 다른 에러?
I1203 09:05:39.628637      97 main.go:136] Traceback (most recent call last):
I1203 09:05:39.628660      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 449, in send
I1203 09:05:39.628669      97 main.go:136]     timeout=timeout
I1203 09:05:39.628680      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
I1203 09:05:39.628688      97 main.go:136]     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
I1203 09:05:39.628693      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/urllib3/util/retry.py", line 573, in increment
I1203 09:05:39.628700      97 main.go:136]     raise MaxRetryError(_pool, url, error or ResponseError(cause))
I1203 09:05:39.628734      97 main.go:136] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20e23d4d68>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
I1203 09:05:39.628744      97 main.go:136] 
I1203 09:05:39.628762      97 main.go:136] During handling of the above exception, another exception occurred:
I1203 09:05:39.628772      97 main.go:136] 
I1203 09:05:39.628862      97 main.go:136] Traceback (most recent call last):
I1203 09:05:39.628873      97 main.go:136]   File "/opt/mxnet-mnist/mnist.py", line 86, in <module>
I1203 09:05:39.628888      97 main.go:136]     fit.fit(args, sym, get_mnist_iter)
I1203 09:05:39.628900      97 main.go:136]   File "/opt/mxnet-mnist/common/fit.py", line 185, in fit
I1203 09:05:39.628917      97 main.go:136]     (train, val) = data_loader(args, kv)
I1203 09:05:39.628927      97 main.go:136]   File "/opt/mxnet-mnist/mnist.py", line 44, in get_mnist_iter
I1203 09:05:39.628955      97 main.go:136]     mnist = mx.test_utils.get_mnist()
I1203 09:05:39.628971      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1907, in get_mnist
I1203 09:05:39.628991      97 main.go:136]     path+'train-labels-idx1-ubyte.gz', path+'train-images-idx3-ubyte.gz')
I1203 09:05:39.628995      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1894, in read_data
I1203 09:05:39.629002      97 main.go:136]     with gzip.open(mx.test_utils.download(label_url)) as flbl:
I1203 09:05:39.629006      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1812, in download
I1203 09:05:39.629014      97 main.go:136]     raise e
I1203 09:05:39.629018      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1802, in download
I1203 09:05:39.629025      97 main.go:136]     r = requests.get(url, stream=True)
I1203 09:05:39.629029      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 76, in get
I1203 09:05:39.629038      97 main.go:136]     return request('get', url, params=params, **kwargs)
I1203 09:05:39.629042      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 61, in request
I1203 09:05:39.629050      97 main.go:136]     return session.request(method=method, url=url, **kwargs)
I1203 09:05:39.629053      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 542, in request
I1203 09:05:39.629061      97 main.go:136]     resp = self.send(prep, **send_kwargs)
I1203 09:05:39.629065      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 655, in send
I1203 09:05:39.629072      97 main.go:136]     r = adapter.send(request, **kwargs)
I1203 09:05:39.629076      97 main.go:136]   File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 516, in send
I1203 09:05:39.629090      97 main.go:136]     raise ConnectionError(e, request=request)
I1203 09:05:39.629094      97 main.go:136] requests.exceptions.ConnectionError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20e23d4d68>: Failed to establish a new connection: [Errno 101] Network is unreachable',))

해결 방법

Istio ingressgateway에서 30분마다 "gRPC 구성 스트림 닫힘: 13" 오류 발생

Istio 1.1.x와 관련된 더 이상 사용하지 않는 옵션 및 구성을 설명하는 다음과 같은 Envoy 로그 메시지가 예상됩니다.

  • 관련 오류 : Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.

  • 해결 방법 : 이에 대한 수정 사항은 이미 Istio 1.2에서 사용할 수 있습니다. 이 문제는 1.2 이상으로 업그레이드할 때 종료되어야 합니다. (링크)
    - 다만 AWS kubeflow는 istio 버전도 종속되어 설치되어있기 때문에 Istio 버전만 업그레이드 했을 때 애플리케이션에 미칠 영향을 예측할 수 없습니다. 일단 넘어가고 다른 오류가 있는지부터 확인해보겠습니다.

url에서 데이터 파일을 받아오지 못하는 경우

앞서 로그를 읽다보니 해당 url에서 파일을 받아오지 못하는 문제를 발견하였습니다.

urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f14ccd56940>: Failed to establish a new connection: [Errno 101] Network is unreachable',))

그래서 docker.io에서 받아오는 도커 이미지를 최신화해서 yaml을 수정하였습니다.
그 후 생성되어있던 Katib 관련 오브젝트를 삭제한 후 재생성해보았습니다.

  • 결과적으로는 이번 문제의 해결방안은 아니었지만, 관련 검색 중에 sidecar-injetion 관련 설정이 해답이라는 것을 알게 되었습니다.

Katib experiments 삭제, 재생성 방법

  • Katib UI에서 삭제하는 방법도 있습니다.

  • kubectl delete -f <파일명>을 이용해도 experiments, trials, jobs, pod 등 관련 오브젝트들이 모두 삭제됩니다.

Sidecar-injection(TFjob 관련 에러)

What is TFJob?
TFJob is a Kubernetes custom resource to run TensorFlow training jobs on Kubernetes. The Kubeflow implementation of TFJob is in training-operator.

Note: TFJob doesn't work in a user namespace by default because of Istio automatic sidecar injection. In order to get TFJob running, it needs annotation sidecar.istio.io/inject: "false" to disable it for TFJob pods.

A TFJob is a resource with a YAML representation like the one below (edit to use the container image and command for your own training code)

TFjob 과련하여 sidecar-inject가 "false"이도록 코드를 추가해주어야합니다.

이제 정상적으로 job이 실행되고, 결과를 그래프로 확인할 수 있게 되었습니다!!

profile
MLOps, MLE 직무로 일하고 있습니다😍

0개의 댓글