[kubeflow Docs] TFJob 이란?

jb·2023년 4월 7일

TFJob kubeflow 공식문서 공부하기

Kubeflow 공식문서 공부하기

목록 보기

1/1

참고

[kubeflow docs] TensorFlow Training (TFJob)

TFJob 이란?

TFJob은 k8s 위에서 tensorflow training job을 수행하기 위한 custom resource. TFJob의 구현체는 training-operator.

training-operator란 k8s 위에서 tensorflow, pytorch, mxnet, xgboost, mpi 등의 job을 distributed 혹은 non-distributed를 돌리는걸 쉽게 해주는 k8s custom resource.

특히, TFJob은 distributed tensorflow training job을 위해 생겨났다고 함.

TFJob과 Istio sidecar injection

TFJob은 Istio의 automatic sidecar injection으로 인해 기본적으로 사용자 namespace에서 작동하지 않음. 이 때, TFJob을 실행하려면 sidecar.istio.io/inject[:](http://sidecar.istio.io/inject:) "false" 주석을 PodTemplateSpec에 추가하여 TFJob 파드에 대해 sidecar injection을 비활성화하면 됨. (아래에서 확인해보자.)

TFJob의 예시

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  generateName: tfjob
  namespace: your-user-namespace
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000
              env:
                - name: GOOGLE_APPLICATION_CREDENTIALS
                  value: "/etc/secrets/user-gcp-sa.json"
              volumeMounts:
                - name: sa
                  mountPath: "/etc/secrets"
                  readOnly: true
          volumes:
            - name: sa
              secret:
                secretName: user-gcp-sa
    Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/your-project/your-image
              command:
                - python
                - -m
                - trainer.task
                - --batch_size=32
                - --training_steps=1000
              env:
                - name: GOOGLE_APPLICATION_CREDENTIALS
                  value: "/etc/secrets/user-gcp-sa.json"
              volumeMounts:
                - name: sa
                  mountPath: "/etc/secrets"
                  readOnly: true
          volumes:
            - name: sa
              secret:
                secretName: user-gcp-sa

TFJob은 yaml로 표현됨.
보통 위에 yaml에서 container image나 training code에 대한 커맨드를 수정하여 사용하는 편.

distributed TensorFlow job의 역할들

tensorflow에서 분산학습을 위한 역할들이 아래와 같이 있으며 모두 optional함.

Chief: model을 checkpointing하는 것 처럼 training을 orchestrating하고 task를 performing하는 역할.
Ps: parameter server로써 모델 파라미터에 대한 distributed data store를 제공함.
Worker: 실제 모델 학습을 수행하는 역할. 때에 따라 worker-0이 chief 역할을 수행함.
Evaluator: 모델이 학습될 때, evaluation metrics을 계산하는 역할.

tfReplicasSpecs

TFJob에서 spec.tfReplicaSpecs 필드에는 위에 아래에 3개의 필드로 구성됨.

replicas : 해당 TFJob에 대해 띄울 복제본 수.
template : 각 replica에서 생성할 파드 명세 PodTemplateSpec.
restartPolicy : 파드가 exit 됐을 때, 재시작 정책을 정의.
- Always : 파드가 exit 됐을 때, 무조건 재시작하는 정책. parameter server는 의도적으로 exit 되는 로직이 없으므로 적용하기 적절함.
- OnFailure : 파드가 Failed로 인해 exit되면 재시작하는 정책.
- ExitCode: exit code에 따라 재시작하는 정책.
  - tensorflow 컨테이너의 exit code
    - 0: 프로세스를 정상적으로 완료하고 종료되므로 재시작하지 않음.
    - 1: general errors
    - 2: misuse of shell builtins
    - 126: command invoked cannot execute
    - 127: command not found
    - 128: invalid argument to exit
    - 139: container terminated by SIGSEGV(invalid memory reference)
- Never : 종료된 파드를 재시작하지 않음. 거의 사용하지 않는 편임. 파드는 다양한 이유에서 종료할 가능성이 높기 때문.

TensorFlow Operator 설치 및 확인

fully kubeflow 설치를 했다면 tensorflow operator는 이미 설치되어 있음. 매뉴얼하게 설치한다면 이 링크를 참고할 것.

설치 확인

### crd 확인
$ kubectl get crd | grep -i tfjob
tfjobs.kubeflow.org                                            2023-02-13T11:36:21Z

### operator 확인
$ kubectl get pods -n kubeflow | grep -i training-operator
training-operator-6f69d4b6cd-lxldc                       1/1     Running   0              167m

Mnist 예제

distributed MNIST example (link)

apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: tfjob-simple
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-mnist-with-summaries:latest
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"

수정 및 배포

$ kubectl apply -f - <<EOF
apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: tfjob-simple
  namespace: kubeflow-jbpark8
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: kubeflow/tf-mnist-with-summaries:latest
              command:
                - "python"
                - "/var/tf_mnist/mnist_with_summaries.py"
EOF

namespace를 테스트할 실제 유저의 것으로 바꿈.

확인

$ kubectl -n kubeflow-jbpark8 get tfjob tfjob-simple -o yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"tfjob-simple","namespace":"kubeflow-jbpark8"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}},"spec":{"containers":[{"command":["python","/var/tf_mnist/mnist_with_summaries.py"],"image":"kubeflow/tf-mnist-with-summaries:latest","name":"tensorflow"}]}}}}}}
  creationTimestamp: "2023-04-07T01:31:21Z"
  generation: 1
  name: tfjob-simple
  namespace: kubeflow-jbpark8
  resourceVersion: "32185395"
  uid: 1cf87978-2d34-4314-bedd-21f71b09782f
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - command:
            - python
            - /var/tf_mnist/mnist_with_summaries.py
            image: kubeflow/tf-mnist-with-summaries:latest
            name: tensorflow
status:
  completionTime: "2023-04-07T01:32:29Z"
  conditions:
  - lastTransitionTime: "2023-04-07T01:31:22Z"
    lastUpdateTime: "2023-04-07T01:31:22Z"
    message: TFJob tfjob-simple is created.
    reason: TFJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: "2023-04-07T01:31:27Z"
    lastUpdateTime: "2023-04-07T01:31:27Z"
    message: TFJob kubeflow-jbpark8/tfjob-simple is running.
    reason: TFJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: "2023-04-07T01:32:29Z"
    lastUpdateTime: "2023-04-07T01:32:29Z"
    message: TFJob kubeflow-jbpark8/tfjob-simple successfully completed.
    reason: TFJobSucceeded
    status: "True"
    type: Succeeded
  replicaStatuses:
    Worker:
      succeeded: 2
  startTime: "2023-04-07T01:31:27Z"

$ kubectl get tfjob
NAME           STATE       AGE
tfjob-simple   Succeeded   4m47s

$ kubectl get po -l training.kubeflow.org/job-name=tfjob-simple -w
NAME                    READY   STATUS    RESTARTS   AGE
tfjob-simple-worker-0   1/1     Running   0          38s
tfjob-simple-worker-1   1/1     Running   0          35s
tfjob-simple-worker-2   1/1     Running   0          33s
tfjob-simple-worker-0   0/1     Completed   0          73s
tfjob-simple-worker-0   0/1     Completed   0          75s
tfjob-simple-worker-1   0/1     Completed   0          73s
tfjob-simple-worker-2   1/1     Terminating   0          72s
tfjob-simple-worker-1   0/1     Terminating   0          74s
tfjob-simple-worker-2   0/1     Terminating   0          73s
tfjob-simple-worker-2   0/1     Terminating   0          73s
tfjob-simple-worker-2   0/1     Terminating   0          73s
tfjob-simple-worker-1   0/1     Terminating   0          75s
tfjob-simple-worker-1   0/1     Terminating   0          75s
tfjob-simple-worker-1   0/1     Terminating   0          75s

학습이 완료되어 파드의 상태가 Completed 되면 worker들은 모두 Terminating 되고 cheif만 Completed 상태로 남음. 이는 CleanPodPolicy 설정에 의한 것으로써 default 값인 Running 이 사용돼서 그럼.
CleanPodPolicy에 사용할 수 있는 값
- Running : job이 끝난 후에 cheif를 제외하고 parameter server 파드를 포함해 모든 파드를 삭제.
- All : job이 끝난 모든 파드 삭제.
- None : job이 끝난 후에도 모든 파드를 삭제하지 않음.

삭제

$ kubectl -n kubeflow-jbpark8 delete tfjob tfjob-simple

TFJob 커스터마이징

TFJob yaml 파일에서 수정할 수 있는 값들

image 이름
replicas 수
resources의 request, limit
environment variables
PV 연결

GPU 사용

전제

일단, 노드에 GPU가 있어야 함.
k8s 클러스터가 nvidia.com/gpu 리소스 타입을 인식해야 함.
클러스터에 GPU driver가 설치되어 있어야 함.
EKS는 다음 문서(Amazon EKS optimized accelerated Amazon Linux AMIs)를 참고해야 함.

예제

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "tf-smoke-gpu"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
            - args:
                - python
                - tf_cnn_benchmarks.py
                - --batch_size=32
                - --model=resnet50
                - --variable_update=parameter_server
                - --flush_stdout=true
                - --num_gpus=1
                - --local_parameter_device=cpu
                - --device=cpu
                - --data_format=NHWC
              image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
              resources:
                limits:
                  cpu: "1"
              workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure
    Worker:
      replicas: 1
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
            - args:
                - python
                - tf_cnn_benchmarks.py
                - --batch_size=32
                - --model=resnet50
                - --variable_update=parameter_server
                - --flush_stdout=true
                - --num_gpus=1
                - --local_parameter_device=cpu
                - --device=gpu
                - --data_format=NHWC
              image: gcr.io/kubeflow/tf-benchmarks-gpu:v20171202-bdab599-dirty-284af3
              name: tensorflow
              ports:
                - containerPort: 2222
                  name: tfjob-port
              resources:
                limits:
                  nvidia.com/gpu: 1
              workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
          restartPolicy: OnFailure

(GPU가 없어서 실제로 해보지는 못함)

tensorflow GPU 사용 관련 문서 참고

job 모니터링

job의 상태 확인

$ kubectl -n kubeflow get -o yaml tfjobs tfjob-simple

apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
  creationTimestamp: "2021-09-06T11:48:09Z"
  generation: 1
  name: tfjob-simple
  namespace: kubeflow
  resourceVersion: "5764004"
  selfLink: /apis/kubeflow.org/v1/namespaces/kubeflow/tfjobs/tfjob-simple
  uid: 3a67a9a9-cb89-4c1f-a189-f49f0b581e29
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - command:
                - python
                - /var/tf_mnist/mnist_with_summaries.py
              image: gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0
              name: tensorflow
status:
  completionTime: "2021-09-06T11:49:30Z"
  conditions:
    - lastTransitionTime: "2021-09-06T11:48:09Z"
      lastUpdateTime: "2021-09-06T11:48:09Z"
      message: TFJob tfjob-simple is created.
      reason: TFJobCreated
      status: "True"
      type: Created
    - lastTransitionTime: "2021-09-06T11:48:12Z"
      lastUpdateTime: "2021-09-06T11:48:12Z"
      message: TFJob kubeflow/tfjob-simple is running.
      reason: TFJobRunning
      status: "False"
      type: Running
    - lastTransitionTime: "2021-09-06T11:49:30Z"
      lastUpdateTime: "2021-09-06T11:49:30Z"
      message: TFJob kubeflow/tfjob-simple successfully completed.
      reason: TFJobSucceeded
      status: "True"
      type: Succeeded
  replicaStatuses:
    Worker:
      succeeded: 2
  startTime: "2021-09-06T11:48:10Z"

Condition

TFJob에는 TFJob이 패스했거나 또는 그렇지 않은 TFJobConditions 배열이 있는 TFJobStatus가 있음. TFJobCondition 배열의 각 요소에는 6개의 가능한 필드가 있음.

lastTransitionTime : 상태 변화 시작 시간.
lastUpdateTime : 상태 변화 마지막 시간.
message : 상태 변화에 대해 human readable한 상세 내역.
reason :
- TFJobCreated : TFJob이 생성을 허가받았으나 아직 관련 pod/service 들이 시작되지는 않은 상태.
- TFJobRunning : 관련 pod/service들이 정상적으로 스케줄되고 시작되어 job이 실행중인 상태.
- TFJobRestarting : 관련 pod/service들이 문제가 생겨 재시작중인 상태.
- TFJobSucceeded : job이 성공적으로 완료된 상태.
- TFJobFailed : job이 실패한 상태.
status : True, False, Unknown
type : reason을 따름.

job의 성공 또는 실패 기준

job이 chief를 갖고 있다면, chief의 status에 따라 결정됨. chief가 없다면, worker의 status에 따라 결정됨.
이런 주체가 누구든 exit code가 0 이면 성공.
exit code가 0 아닐 경우, replica에 대한 restartPolicy가 수행됨.
restartPolicy가 프로세스 재시작 정책을 따른다면 재시작함. 반면에, 재시작 정책이 없다면 failed 됨.

기록하는 엔지니어 되기 💪