kubectl get pods -A
로 확인했을 땐 일부 pods가 Error 상태임.kubectl get jobs -A
로 확인했을 때 Katib 관련 job이 하나도 COMPLETIONS 된 것이 없음.(base) noharam@nohalam-ui-MacBookPro ~ % kubectl describe pod/random-qwllpfh9-mgqtd -n moey920
Name: random-qwllpfh9-mgqtd
Namespace: moey920
Priority: 0
Node: ip-192-168-26-121.ap-northeast-2.compute.internal/192.168.26.121
Start Time: Fri, 03 Dec 2021 18:04:59 +0900
Labels: controller-uid=75ec3557-7bb3-47cf-b518-379f5cda74b2
job-name=random-qwllpfh9
Annotations: kubernetes.io/psp: eks.privileged
sidecar.istio.io/status:
{"version":"5f3ae3613c7945ef767cb9fd594596bc001ff3ab915f12e4379c0cb5648d2729","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status: Running
IP: 192.168.26.112
IPs:
IP: 192.168.26.112
Controlled By: Job/random-qwllpfh9
Init Containers:
istio-init:
Container ID: containerd://f5431ad2e212cc46b98a0024281420c71c50d64ec2ea1a2f5092881597eed031
Image: docker.io/istio/proxy_init:1.1.6
Image ID: docker.io/istio/proxy_init@sha256:54d89fb2b3b0a2365f2d2b0a8862f1f8320a63ab6a09c637c60f13f6021c4609
Port: <none>
Host Port: <none>
Args:
-p
15001
-u
1337
-m
REDIRECT
-i
*
-x
-b
-d
15020
State: Terminated
Reason: Completed
Exit Code: 0
Started: Fri, 03 Dec 2021 18:05:00 +0900
Finished: Fri, 03 Dec 2021 18:05:00 +0900
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 50Mi
Requests:
cpu: 10m
memory: 10Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
Containers:
training-container:
Container ID: containerd://59c4a5604a91435ec31cb6003abf13f52782136d73a2e418ea1e19bbc352063b
Image: docker.io/kubeflowkatib/mxnet-mnist:v1beta1-45c5727
Image ID: docker.io/kubeflowkatib/mxnet-mnist@sha256:9bbfc47d1fc369e79d0b4e83f26b3060941eb0d0792c758a4ce27b4bd90a6c48
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
python3 /opt/mxnet-mnist/mnist.py --batch-size=64 --lr=0.012918714475815489 --num-layers=4 --optimizer=adam 1>/var/log/katib/metrics.log 2>&1 && echo completed > /var/log/katib/$$$$.pid
State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 03 Dec 2021 18:05:30 +0900
Finished: Fri, 03 Dec 2021 18:05:32 +0900
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
istio-proxy:
Container ID: containerd://d8bc2a269feaa77fc5ab0f3ecb67f062c65c3452e75971ef67137bafef1aa586
Image: docker.io/istio/proxyv2:1.1.6
Image ID: docker.io/istio/proxyv2@sha256:e7ee1ad38bd5b556ad0527ac691a9f647b66835960417b154c5d28b2ed9219cb
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--configPath
/etc/istio/proxy
--binaryPath
/usr/local/bin/envoy
--serviceCluster
istio-proxy.moey920
--drainDuration
45s
--parentShutdownDuration
1m0s
--discoveryAddress
istio-pilot.istio-system:15010
--zipkinAddress
zipkin.istio-system:9411
--connectTimeout
10s
--proxyAdminPort
15000
--concurrency
2
--controlPlaneAuthPolicy
NONE
--statusPort
15020
--applicationPorts
State: Running
Started: Fri, 03 Dec 2021 18:05:31 +0900
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 128Mi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15020/healthz/ready delay=1s timeout=1s period=2s #success=1 #failure=30
Environment:
POD_NAME: random-qwllpfh9-mgqtd (v1:metadata.name)
POD_NAMESPACE: moey920 (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
ISTIO_META_POD_NAME: random-qwllpfh9-mgqtd (v1:metadata.name)
ISTIO_META_CONFIG_NAMESPACE: moey920 (v1:metadata.namespace)
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_METAJSON_ANNOTATIONS: {"kubernetes.io/psp":"eks.privileged"}
ISTIO_METAJSON_LABELS: {"controller-uid":"75ec3557-7bb3-47cf-b518-379f5cda74b2","job-name":"random-qwllpfh9"}
Mounts:
/etc/certs/ from istio-certs (ro)
/etc/istio/proxy from istio-envoy (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
metrics-logger-and-collector:
Container ID: containerd://c73729be7f64d7ddf01c6ede062e6aa505dbf621053eeb68c42d00cf69ee3543
Image: docker.io/kubeflowkatib/file-metrics-collector:v1beta1-a96ff59
Image ID: docker.io/kubeflowkatib/file-metrics-collector@sha256:f262616f5adea780dacaabfdd1c8338b7c9eb7bd16088ae2acdc1887a0020869
Port: <none>
Host Port: <none>
Args:
-t
random-qwllpfh9
-m
Validation-accuracy;Train-accuracy
-o-type
maximize
-s-db
katib-db-manager.kubeflow:6789
-path
/var/log/katib/metrics.log
State: Running
Started: Fri, 03 Dec 2021 18:05:39 +0900
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment: <none>
Mounts:
/var/log/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hfjl5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-hfjl5:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-certs:
Type: Secret (a volume populated by a Secret)
SecretName: istio.default
Optional: true
metrics-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events: <none>
kubectl logs -f pod/random-qwllpfh9-mgqtd -n moey920 --all-containers
를 확인해보니 특정 에러가 지속적으로 발생함.[2021-12-06 03:12:47.347][81][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13,
[2021-12-06 03:12:48.213][81][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2021-12-06 03:34:04.476][76][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13,
[2021-12-06 03:34:05.611][76][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
I1203 09:05:39.647637 102 main.go:136]
I1203 09:05:39.647646 102 main.go:136] During handling of the above exception, another exception occurred:
I1203 09:05:39.647652 102 main.go:136]
I1203 09:05:39.647660 102 main.go:136] Traceback (most recent call last):
I1203 09:05:39.647668 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 449, in send
I1203 09:05:39.647675 102 main.go:136] timeout=timeout
+ for gid in '${PROXY_GID}'
+ iptables -t nat -A ISTIO_OUTPUT -m owner --gid-owner 1337 -j RETURN
+ iptables -t nat -A ISTIO_OUTPUT -d 127.0.0.1/32 -j RETURN
[2021-12-06 04:06:47.945][76][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13,
[2021-12-06 04:06:49.014][76][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
[2021-12-06 04:34:53.461][76][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13,
I1203 09:05:39.647678 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
I1203 09:05:39.647774 102 main.go:136] method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
I1203 09:05:39.647790 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/urllib3/util/retry.py", line 573, in increment
I1203 09:05:39.647808 102 main.go:136] raise MaxRetryError(_pool, url, error or ResponseError(cause))
I1203 09:05:39.647811 102 main.go:136] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faa0d14ebe0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
[2021-12-06 04:34:54.407][76][warning][misc] [external/envoy/source/common/protobuf/utility.cc:174] Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
I1203 09:05:39.647819 102 main.go:136]
I1203 09:05:39.647833 102 main.go:136] During handling of the above exception, another exception occurred:
I1203 09:05:39.647850 102 main.go:136]
I1203 09:05:39.647854 102 main.go:136] Traceback (most recent call last):
I1203 09:05:39.647861 102 main.go:136] File "/opt/mxnet-mnist/mnist.py", line 86, in <module>
+ '[' -n '' ']'
+ '[' '*' == '*' ']'
+ iptables -t nat -A ISTIO_OUTPUT -j ISTIO_REDIRECT
+ set +o nounset
I1203 09:05:39.647870 102 main.go:136] fit.fit(args, sym, get_mnist_iter)
+ '[' -n '' ']'
I1203 09:05:39.647891 102 main.go:136] File "/opt/mxnet-mnist/common/fit.py", line 185, in fit
+ ip6tables -F INPUT
+ ip6tables -A INPUT -m state --state ESTABLISHED -j ACCEPT
+ ip6tables -A INPUT -i lo -d ::1 -j ACCEPT
+ ip6tables -A INPUT -j REJECT
+ dump
+ iptables-save
# Generated by iptables-save v1.6.0 on Fri Dec 3 09:04:59 2021
*mangle
:PREROUTING ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
I1203 09:05:39.647896 102 main.go:136] (train, val) = data_loader(args, kv)
I1203 09:05:39.647903 102 main.go:136] File "/opt/mxnet-mnist/mnist.py", line 44, in get_mnist_iter
I1203 09:05:39.647907 102 main.go:136] mnist = mx.test_utils.get_mnist()
I1203 09:05:39.647914 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1907, in get_mnist
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
COMMIT
# Completed on Fri Dec 3 09:04:59 2021
# Generated by iptables-save v1.6.0 on Fri Dec 3 09:04:59 2021
*nat
:PREROUTING ACCEPT [0:0]
I1203 09:05:39.647919 102 main.go:136] path+'train-labels-idx1-ubyte.gz', path+'train-images-idx3-ubyte.gz')
I1203 09:05:39.647943 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1894, in read_data
I1203 09:05:39.647958 102 main.go:136] with gzip.open(mx.test_utils.download(label_url)) as flbl:
I1203 09:05:39.647962 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1812, in download
I1203 09:05:39.647970 102 main.go:136] raise e
I1203 09:05:39.647974 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1802, in download
I1203 09:05:39.647982 102 main.go:136] r = requests.get(url, stream=True)
I1203 09:05:39.647991 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 76, in get
I1203 09:05:39.648002 102 main.go:136] return request('get', url, params=params, **kwargs)
I1203 09:05:39.648008 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 61, in request
I1203 09:05:39.648015 102 main.go:136] return session.request(method=method, url=url, **kwargs)
I1203 09:05:39.648019 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 542, in request
I1203 09:05:39.648037 102 main.go:136] resp = self.send(prep, **send_kwargs)
I1203 09:05:39.648046 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 655, in send
I1203 09:05:39.648060 102 main.go:136] r = adapter.send(request, **kwargs)
I1203 09:05:39.648072 102 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 516, in send
I1203 09:05:39.648079 102 main.go:136] raise ConnectionError(e, request=request)
I1203 09:05:39.648111 102 main.go:136] requests.exceptions.ConnectionError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7faa0d14ebe0>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
:POSTROUTING ACCEPT [0:0]
:ISTIO_IN_REDIRECT - [0:0]
:ISTIO_OUTPUT - [0:0]
:ISTIO_REDIRECT - [0:0]
-A OUTPUT -p tcp -j ISTIO_OUTPUT
-A ISTIO_IN_REDIRECT -p tcp -j REDIRECT --to-ports 15001
-A ISTIO_OUTPUT ! -d 127.0.0.1/32 -o lo -j ISTIO_REDIRECT
-A ISTIO_OUTPUT -m owner --uid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -m owner --gid-owner 1337 -j RETURN
-A ISTIO_OUTPUT -d 127.0.0.1/32 -j RETURN
-A ISTIO_OUTPUT -j ISTIO_REDIRECT
-A ISTIO_REDIRECT -p tcp -j REDIRECT --to-ports 15001
COMMIT
# Completed on Fri Dec 3 09:04:59 2021
+ ip6tables-save
# Generated by ip6tables-save v1.6.0 on Fri Dec 3 09:04:59 2021
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state ESTABLISHED -j ACCEPT
-A INPUT -d ::1/128 -i lo -j ACCEPT
-A INPUT -j REJECT --reject-with icmp6-port-unreachable
COMMIT
# Completed on Fri Dec 3 09:04:59 2021
ENVOY_PORT=
ISTIO_INBOUND_INTERCEPTION_MODE=
ISTIO_INBOUND_TPROXY_MARK=
ISTIO_INBOUND_TPROXY_ROUTE_TABLE=
ISTIO_INBOUND_PORTS=
ISTIO_LOCAL_EXCLUDE_PORTS=
ISTIO_SERVICE_CIDR=
ISTIO_SERVICE_EXCLUDE_CIDR=
I1203 09:05:39.628637 97 main.go:136] Traceback (most recent call last):
I1203 09:05:39.628660 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 449, in send
I1203 09:05:39.628669 97 main.go:136] timeout=timeout
I1203 09:05:39.628680 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/urllib3/connectionpool.py", line 756, in urlopen
I1203 09:05:39.628688 97 main.go:136] method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
I1203 09:05:39.628693 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/urllib3/util/retry.py", line 573, in increment
I1203 09:05:39.628700 97 main.go:136] raise MaxRetryError(_pool, url, error or ResponseError(cause))
I1203 09:05:39.628734 97 main.go:136] urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20e23d4d68>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
I1203 09:05:39.628744 97 main.go:136]
I1203 09:05:39.628762 97 main.go:136] During handling of the above exception, another exception occurred:
I1203 09:05:39.628772 97 main.go:136]
I1203 09:05:39.628862 97 main.go:136] Traceback (most recent call last):
I1203 09:05:39.628873 97 main.go:136] File "/opt/mxnet-mnist/mnist.py", line 86, in <module>
I1203 09:05:39.628888 97 main.go:136] fit.fit(args, sym, get_mnist_iter)
I1203 09:05:39.628900 97 main.go:136] File "/opt/mxnet-mnist/common/fit.py", line 185, in fit
I1203 09:05:39.628917 97 main.go:136] (train, val) = data_loader(args, kv)
I1203 09:05:39.628927 97 main.go:136] File "/opt/mxnet-mnist/mnist.py", line 44, in get_mnist_iter
I1203 09:05:39.628955 97 main.go:136] mnist = mx.test_utils.get_mnist()
I1203 09:05:39.628971 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1907, in get_mnist
I1203 09:05:39.628991 97 main.go:136] path+'train-labels-idx1-ubyte.gz', path+'train-images-idx3-ubyte.gz')
I1203 09:05:39.628995 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1894, in read_data
I1203 09:05:39.629002 97 main.go:136] with gzip.open(mx.test_utils.download(label_url)) as flbl:
I1203 09:05:39.629006 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1812, in download
I1203 09:05:39.629014 97 main.go:136] raise e
I1203 09:05:39.629018 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/mxnet/test_utils.py", line 1802, in download
I1203 09:05:39.629025 97 main.go:136] r = requests.get(url, stream=True)
I1203 09:05:39.629029 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 76, in get
I1203 09:05:39.629038 97 main.go:136] return request('get', url, params=params, **kwargs)
I1203 09:05:39.629042 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/api.py", line 61, in request
I1203 09:05:39.629050 97 main.go:136] return session.request(method=method, url=url, **kwargs)
I1203 09:05:39.629053 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 542, in request
I1203 09:05:39.629061 97 main.go:136] resp = self.send(prep, **send_kwargs)
I1203 09:05:39.629065 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/sessions.py", line 655, in send
I1203 09:05:39.629072 97 main.go:136] r = adapter.send(request, **kwargs)
I1203 09:05:39.629076 97 main.go:136] File "/usr/local/lib/python3.5/dist-packages/requests/adapters.py", line 516, in send
I1203 09:05:39.629090 97 main.go:136] raise ConnectionError(e, request=request)
I1203 09:05:39.629094 97 main.go:136] requests.exceptions.ConnectionError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f20e23d4d68>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
[warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:86] gRPC config stream closed: 13
관련 오류 : Using deprecated option 'envoy.api.v2.Listener.use_original_dst' from file lds.proto. This configuration will be removed from Envoy soon. Please see https://www.envoyproxy.io/docs/envoy/latest/intro/deprecated for details.
해결 방법 : 이에 대한 수정 사항은 이미 Istio 1.2에서 사용할 수 있습니다. 이 문제는 1.2 이상으로 업그레이드할 때 종료되어야 합니다. (링크)
- 다만 AWS kubeflow는 istio 버전도 종속되어 설치되어있기 때문에 Istio 버전만 업그레이드 했을 때 애플리케이션에 미칠 영향을 예측할 수 없습니다. 일단 넘어가고 다른 오류가 있는지부터 확인해보겠습니다.
앞서 로그를 읽다보니 해당 url에서 파일을 받아오지 못하는 문제를 발견하였습니다.
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='data.mxnet.io', port=80): Max retries exceeded with url: /data/mnist/train-labels-idx1-ubyte.gz (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f14ccd56940>: Failed to establish a new connection: [Errno 101] Network is unreachable',))
그래서 docker.io에서 받아오는 도커 이미지를 최신화해서 yaml을 수정하였습니다.
그 후 생성되어있던 Katib 관련 오브젝트를 삭제한 후 재생성해보았습니다.
Katib UI에서 삭제하는 방법도 있습니다.
kubectl delete -f <파일명>
을 이용해도 experiments, trials, jobs, pod 등 관련 오브젝트들이 모두 삭제됩니다.
What is TFJob?
TFJob is a Kubernetes custom resource to run TensorFlow training jobs on Kubernetes. The Kubeflow implementation of TFJob is in training-operator.
Note: TFJob doesn't work in a user namespace by default because of Istio automatic sidecar injection. In order to get TFJob running, it needs annotation sidecar.istio.io/inject: "false" to disable it for TFJob pods.
A TFJob is a resource with a YAML representation like the one below (edit to use the container image and command for your own training code)
TFjob 과련하여 sidecar-inject가 "false"이도록 코드를 추가해주어야합니다.
이제 정상적으로 job이 실행되고, 결과를 그래프로 확인할 수 있게 되었습니다!!