참고
TFJob
은 k8s 위에서 tensorflow training job을 수행하기 위한 custom resource. TFJob의 구현체는 training-operator
.
training-operator란 k8s 위에서 tensorflow, pytorch, mxnet, xgboost, mpi 등의 job을 distributed 혹은 non-distributed를 돌리는걸 쉽게 해주는 k8s custom resource.
특히, TFJob은 distributed tensorflow training job을 위해 생겨났다고 함.
TFJob과 Istio sidecar injection
sidecar.istio.io/inject[:](http://sidecar.istio.io/inject:) "false"
주석을 PodTemplateSpec에 추가하여 TFJob 파드에 대해 sidecar injection을 비활성화하면 됨. (아래에서 확인해보자.)TFJob의 예시
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
generateName: tfjob
namespace: your-user-namespace
spec:
tfReplicaSpecs:
PS:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/etc/secrets/user-gcp-sa.json"
volumeMounts:
- name: sa
mountPath: "/etc/secrets"
readOnly: true
volumes:
- name: sa
secret:
secretName: user-gcp-sa
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: gcr.io/your-project/your-image
command:
- python
- -m
- trainer.task
- --batch_size=32
- --training_steps=1000
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: "/etc/secrets/user-gcp-sa.json"
volumeMounts:
- name: sa
mountPath: "/etc/secrets"
readOnly: true
volumes:
- name: sa
secret:
secretName: user-gcp-sa
distributed TensorFlow job의 역할들
tensorflow에서 분산학습을 위한 역할들이 아래와 같이 있으며 모두 optional함.
tfReplicasSpecs
TFJob에서 spec.tfReplicaSpecs
필드에는 위에 아래에 3개의 필드로 구성됨.
0
: 프로세스를 정상적으로 완료하고 종료되므로 재시작하지 않음.1
: general errors2
: misuse of shell builtins126
: command invoked cannot execute127
: command not found128
: invalid argument to exit139
: container terminated by SIGSEGV(invalid memory reference)fully kubeflow 설치를 했다면 tensorflow operator는 이미 설치되어 있음. 매뉴얼하게 설치한다면 이 링크를 참고할 것.
설치 확인
### crd 확인
$ kubectl get crd | grep -i tfjob
tfjobs.kubeflow.org 2023-02-13T11:36:21Z
### operator 확인
$ kubectl get pods -n kubeflow | grep -i training-operator
training-operator-6f69d4b6cd-lxldc 1/1 Running 0 167m
distributed MNIST example (link)
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: tfjob-simple
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- name: tensorflow
image: kubeflow/tf-mnist-with-summaries:latest
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
수정 및 배포
$ kubectl apply -f - <<EOF
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
name: tfjob-simple
namespace: kubeflow-jbpark8
spec:
tfReplicaSpecs:
Worker:
replicas: 3
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- name: tensorflow
image: kubeflow/tf-mnist-with-summaries:latest
command:
- "python"
- "/var/tf_mnist/mnist_with_summaries.py"
EOF
확인
$ kubectl -n kubeflow-jbpark8 get tfjob tfjob-simple -o yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"tfjob-simple","namespace":"kubeflow-jbpark8"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}},"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py"],"image":"kubeflow/tf-mnist-with-summaries:latest","name":"tensorflow"}]}}}}}}
creationTimestamp: "2023-04-07T01:31:21Z"
generation: 1
name: tfjob-simple
namespace: kubeflow-jbpark8
resourceVersion: "32185395"
uid: 1cf87978-2d34-4314-bedd-21f71b09782f
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- command:
- python
- /var/tf_mnist/mnist_with_summaries.py
image: kubeflow/tf-mnist-with-summaries:latest
name: tensorflow
status:
completionTime: "2023-04-07T01:32:29Z"
conditions:
- lastTransitionTime: "2023-04-07T01:31:22Z"
lastUpdateTime: "2023-04-07T01:31:22Z"
message: TFJob tfjob-simple is created.
reason: TFJobCreated
status: "True"
type: Created
- lastTransitionTime: "2023-04-07T01:31:27Z"
lastUpdateTime: "2023-04-07T01:31:27Z"
message: TFJob kubeflow-jbpark8/tfjob-simple is running.
reason: TFJobRunning
status: "False"
type: Running
- lastTransitionTime: "2023-04-07T01:32:29Z"
lastUpdateTime: "2023-04-07T01:32:29Z"
message: TFJob kubeflow-jbpark8/tfjob-simple successfully completed.
reason: TFJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Worker:
succeeded: 2
startTime: "2023-04-07T01:31:27Z"
$ kubectl get tfjob
NAME STATE AGE
tfjob-simple Succeeded 4m47s
$ kubectl get po -l training.kubeflow.org/job-name=tfjob-simple -w
NAME READY STATUS RESTARTS AGE
tfjob-simple-worker-0 1/1 Running 0 38s
tfjob-simple-worker-1 1/1 Running 0 35s
tfjob-simple-worker-2 1/1 Running 0 33s
tfjob-simple-worker-0 0/1 Completed 0 73s
tfjob-simple-worker-0 0/1 Completed 0 75s
tfjob-simple-worker-1 0/1 Completed 0 73s
tfjob-simple-worker-2 1/1 Terminating 0 72s
tfjob-simple-worker-1 0/1 Terminating 0 74s
tfjob-simple-worker-2 0/1 Terminating 0 73s
tfjob-simple-worker-2 0/1 Terminating 0 73s
tfjob-simple-worker-2 0/1 Terminating 0 73s
tfjob-simple-worker-1 0/1 Terminating 0 75s
tfjob-simple-worker-1 0/1 Terminating 0 75s
tfjob-simple-worker-1 0/1 Terminating 0 75s
CleanPodPolicy
설정에 의한 것으로써 default 값인 Running
이 사용돼서 그럼.CleanPodPolicy
에 사용할 수 있는 값Running
: job이 끝난 후에 cheif를 제외하고 parameter server 파드를 포함해 모든 파드를 삭제.All
: job이 끝난 모든 파드 삭제.None
: job이 끝난 후에도 모든 파드를 삭제하지 않음.삭제
$ kubectl -n kubeflow-jbpark8 delete tfjob tfjob-simple
TFJob yaml 파일에서 수정할 수 있는 값들
전제
nvidia.com/gpu
리소스 타입을 인식해야 함.예제
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
cpu: "1"
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=gpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 1
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
(GPU가 없어서 실제로 해보지는 못함)
job의 상태 확인
$ kubectl -n kubeflow get -o yaml tfjobs tfjob-simple
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
creationTimestamp: "2021-09-06T11:48:09Z"
generation: 1
name: tfjob-simple
namespace: kubeflow
resourceVersion: "5764004"
selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/tfjobs/tfjob-simple
uid: 3a67a9a9-cb89-4c1f-a189-f49f0b581e29
spec:
tfReplicaSpecs:
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- command:
- python
- /var/tf_mnist/mnist_with_summaries.py
image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
name: tensorflow
status:
completionTime: "2021-09-06T11:49:30Z"
conditions:
- lastTransitionTime: "2021-09-06T11:48:09Z"
lastUpdateTime: "2021-09-06T11:48:09Z"
message: TFJob tfjob-simple is created.
reason: TFJobCreated
status: "True"
type: Created
- lastTransitionTime: "2021-09-06T11:48:12Z"
lastUpdateTime: "2021-09-06T11:48:12Z"
message: TFJob kubeflow/tfjob-simple is running.
reason: TFJobRunning
status: "False"
type: Running
- lastTransitionTime: "2021-09-06T11:49:30Z"
lastUpdateTime: "2021-09-06T11:49:30Z"
message: TFJob kubeflow/tfjob-simple successfully completed.
reason: TFJobSucceeded
status: "True"
type: Succeeded
replicaStatuses:
Worker:
succeeded: 2
startTime: "2021-09-06T11:48:10Z"
Condition
TFJob에는 TFJob이 패스했거나 또는 그렇지 않은 TFJobConditions 배열이 있는 TFJobStatus가 있음. TFJobCondition 배열의 각 요소에는 6개의 가능한 필드가 있음.
job의 성공 또는 실패 기준