쿠버네티스 자동 배치 패턴

Jeongmin Yeo (Ethan)·2021년 3월 27일

kubernetes

Kubernetes

목록 보기

4/4

자동 배치 패턴

Automated Placement 패턴은 쿠버네티스 스케줄러의 핵심 기능으로 컨테이너 자원 요청을 만족하고 스케쥴링 정책을 준수하는 노드에 신규 파드를 할당해주는 기능이다.

이 패턴은 쿠버네티스 스케쥴링 알고리즘 원리와 쿠버네티스 외부에서 배치 결정에 영향을 주는 방법에 대해 설명한다.

문제

합리적인 규모의 마이크로서비스 기반 시스템은 수십 또는 수백 개의 격리된 프로세스로 구성괸다.

파드는 패키징과 배포를 위한 좋은 추상화를 제공하지만 적절한 노드에 이런 프로세스들을 배치하는 문제는 해결하지 못한다.

마이크로서비스가 늘어날수록 파드를 노드에 할당하고 배치하는건 어려운 작업이면서도 중요하다.

클러스터에서 사용 가능한 자원은 시간 지남에 따라 클러스터를 확장 축소, 이미 배치된 컨테이너가 자원을 소비하메 따라 다양하게 변한다. 컨테이너 배치는 가용성 성능 뿐 아니라 용량에도 영향을 준다.

해결책

쿠버네티스에서 파드는 스케쥴러에 의해 노드에 할당된다.

스케쥴러는 다양한 설정이 가능하고 여전히 변화하고 있다. 여기서는 주요 스케쥴링 컨트롤 매커니즘과 파드 배치에 영향을 주는 기능을 다루겠다.

상위 레벨에서 보면 쿠버네티스 스케쥴러가 수행하는 주요 작업은 API 서버로부터 새로 생성된 파드 정의를 조희하고 파드를 노드에 할당하는 것이다.

스케쥴러는 초기 애플리케이션을 노드에 배치하는 것이든, 스케일 업을 위한 것이든, 애플리케이션이 비정상 노드에서 정상 노드로 이동하든 간에 모든 상황에서 적절한 노드를 찾는다.

이는 런타임 의존성, 자원 요구사항, 고가용성 가이드 정책 등에 대한 고려를 통해 배치한다.

스케줄러가 스케줄링 작업을 정확하게 수행하고 선언적 배치를 가능하게 하려면 적절한 용량을 확보한 노드와 선언적 자원 프로파일 및 가이드 정책을 갖춘 컨테이너가 필요하다. 각각에 대해 알아보자

가용한 노드 자원

먼저 쿠버네티스 클러스터에서 새로운 파드를 실행하기 위해서는 충분한 자원을 가지고 있는 노드가 있어야 한다.

모든 노드에는 파드를 실행할 수 있는 용량이 있고 스케줄러는 파드가 요청한 자원의 총합이 할당 가능한 노드의 용량보다 작다는 것을 확인한 후 배치할 수 있다.

쿠버네티스에서 파드에 할당 가능한 용량은 다음 공식을 따른다.

Allocatable [애플리케이션 파드에 대한 용량] = Node Capacity [하나의 노드에 가용한 용량] - Kube-Reserved [큐블릿, 컨테이너 런타임 같은 쿠버네티스 데몬] - System Reserved [sshd, udev 같은 os 시스템 데몬]

os와 시스템 데몬 용도의 자원을 미리 예약해놓지 않는다면 파드가 이를 사용할려고 경쟁할 수 있다. 이는 노드의 자원 부족 문제를 발생시킬 수 있으므로 쿠버네티스 노드 용량 계산에 반영되어야 한다.

이런 제한에 대한 임시적 해결 방법은 아무 일도 하지 않는 placeholder pod를 실행하는 것으로 가능하다.

이 placeholder는 관리되지 않는 컨테이너의 자원 사용량과 잋리하는 CPU와 메모리에 대한 자원 요청만 있다. 이런 파드를 통해서 스케쥴러는 노드에 대한 더 좋은 자원 모델을 구축하는데 도움을 준다.

컨테이너 자원 요구

또 다른 효율적인 파드 배치를 위한 요구사항은 컨테이너가 런타임 의존성과 자원 요구 정의를 갖는 것이다.

컨테이너가 사용할 request와 limit을 갖는 자원 프로파일과 스토로지 또는 의존성을 선언해서 노드가 이를 보고 배치하도록 해서 서로 영향 없이 실행할 수 있도록 하는 방법이 있다.

배치(Placement) 정책

마지막은 올바른 필터를 가지거나 특정 어플리케이션 요구에 대해 우선순위 정책에 대한 것이다.

대부분의 사용 에에서는 기분 우선순위 정책이 설정된 스케쥴링만으로 충분하다.

다음 예제에서 보여주는 것처럼 스케쥴러를 실행할 때 기본 스케쥴러 정책을 다른 정책으로 덮어 쓰는 것도 가능하다.

스케쥴러 정책과 사용자 정의 스케쥴러는 오로지 관리자만이 클러스터 설정으로 정의하는게 가능하다.

일반 사용자는 미리 정의된 스케줄러만 참조할 수 있다.

{
  "kind": "Policy",
  "apiVersion": "v1",
  "predicates": [
    {"name": "PodFitsHostPorts"},
    {"name": "PodFitsResources"},
    {"name": "NoDiskConflict"},
    {"name": "NoVolumeZoneConflict"},
    {"name": "MatchNodeSelector"},
    {"name": "HostName"}
  ], 
  "priorities": [
    {"name": "LeastReqeustPriority", "weight": 2},
    {"name": "BalancedResourceAllocation", "weight": 1}, // 
    {"name": "ServiceSpreadingPriority", "weight": 2},
    {"name": "EqualPriority", weight: "1"}
  ]
}

스케쥴링 정책은 다음과 같이 predicates와 priority 이렇게 두 부분으로 구별된다.

predicates는 자격 없는 노드를 필터링 할때 사용되는 규칙이다. 예를들면 PodFitsHostPorts는 이 포트 사용이 가능한 노드에게만 배치되도록 하는 정책이다.

priorities 정책은 우선순위 정책으로 사용 가능한 노드를 정렬하는 규칙이다. 예로 LeastRequestPriority는 노드에 사용ㄱ 가능한 자원이 많을수록 더 높은 우선 순위를 부여한다.

Priorities와 Predicates는 다음과 같이 다양하다.

Prorities

SelectorSpreadPriority: Spreads Pods across hosts, considering Pods that belong to the same Service, StatefulSet or ReplicaSet.
InterPodAffinityPriority: Implements preferred inter pod affininity and antiaffinity.
LeastRequestedPriority: Favors nodes with fewer requested resources. In other words, the more Pods that are placed on a Node, and the more resources those Pods use, the lower the ranking this policy will give.
MostRequestedPriority: Favors nodes with most requested resources. This policy will fit the scheduled Pods onto the smallest number of Nodes needed to run your overall set of workloads.
RequestedToCapacityRatioPriority: Creates a requestedToCapacity based ResourceAllocationPriority using default resource scoring function shape.
BalancedResourceAllocation: Favors nodes with balanced resource usage.
NodePreferAvoidPodsPriority: Prioritizes nodes according to the node annotation scheduler.alpha.kubernetes.io/preferAvoidPods. You can use this to hint that two different Pods shouldn't run on the same Node.
NodeAffinityPriority: Prioritizes nodes according to node affinity scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. You can read more about this in Assigning Pods to Nodes.
TaintTolerationPriority: Prepares the priority list for all the nodes, based on the number of intolerable taints on the node. This policy adjusts a node's rank taking that list into account.
ImageLocalityPriority: Favors nodes that already have the container images for that Pod cached locally.
ServiceSpreadingPriority: For a given Service, this policy aims to make sure that the Pods for the Service run on different nodes. It favours scheduling onto nodes that don't have Pods for the service already assigned there. The overall outcome is that the Service becomes more resilient to a single Node failure.
EqualPriority: Gives an equal weight of one to all nodes.
EvenPodsSpreadPriority: Implements preferred pod topology spread constraints.

Predicates

PodFitsHostPorts: Checks if a Node has free ports (the network protocol kind) for the Pod ports the Pod is requesting.
PodFitsHost: Checks if a Pod specifies a specific Node by its hostname.
PodFitsResources: Checks if the Node has free resources (eg, CPU and Memory) to meet the requirement of the Pod.
MatchNodeSelector: Checks if a Pod's Node Selector matches the Node's label(s).
NoVolumeZoneConflict: Evaluate if the Volumes that a Pod requests are available on the Node, given the failure zone restrictions for that storage.
NoDiskConflict: Evaluates if a Pod can fit on a Node due to the volumes it requests, and those that are already mounted.
MaxCSIVolumeCount: Decides how many CSI volumes should be attached, and whether that's over a configured limit.
CheckNodeMemoryPressure: If a Node is reporting memory pressure, and there's no configured exception, the Pod won't be scheduled there.
CheckNodePIDPressure: If a Node is reporting that process IDs are scarce, and there's no configured exception, the Pod won't be scheduled there.
CheckNodeDiskPressure: If a Node is reporting storage pressure (a filesystem that is full or nearly full), and there's no configured exception, the Pod won't be scheduled there.
CheckNodeCondition: Nodes can report that they have a completely full filesystem, that networking isn't available or that kubelet is otherwise not ready to run Pods. If such a condition is set for a Node, and there's no configured exception, the Pod won't be scheduled there.
PodToleratesNodeTaints: checks if a Pod's tolerations can tolerate the Node's taints.
CheckVolumeBinding: Evaluates if a Pod can fit due to the volumes it requests. This applies for both bound and unbound PVCs.

스케쥴링 프로세스

파드는 배치 정책에 따라 특정 노드에 할당된다.

좀 더 자세히 알아보기 위해 이런 요소들이 어떻게 구성되며, 파드가 스케쥴링될 때 거치는 주요 단계를 살펴보면 다음과 같다.

먼저 노드에 할당되지 않은 파드가 생성되는 즉시 곧바로 스케줄러는 할당 가능한 모든 노드 그리고 필터링 정책을 통해 조건에 부합되지 않은 노드를 제거하고 우선순위 정책을 통해 노드를 정렬한 후 최적의 노드를 찾아서 배치한다.

일부 경우에는 파드를 특정 노드나 노드 그룹에 강제로 할당하는 걸 원할 수 있다.

이러한 할당은 노드 셀렉터(Node Selector)를 통해 수행할 수 있다.

노드 셀렉터는 파드의 필드로서 .spec.nodeSelector의 값이 지정되어야 한다.

예를 들어 SSD 스토로지나 GPU 가속 하드웨어를 가지고 있는 노드에 이 파드가 할당되어야 한다고 해보자

그러면 Pod를 정의할 때 다음과 같이 nodeSelector를 정의할 수 있다.

apiVersion: v1
kind: Pod
metadata:
	name: random-generator
spec:
	containers: 
	- image: k8spatterns/random-generator:0.0.1
		name: random-generator
	nodeSelector:
		disktype: ssd

노드 어퍼니티

쿠버네티스는 스케줄링 프로세스를 설정하기 위한 많은 유연한 방법을 지원한다.

그 중 한 가지 기능인 노드 어퍼니티(node affinity)는 앞서 설명한 노드 셀렉터 접근 방식을 일반화한 것으로 필수 규칙(required) 혹은 선호 규칙(preferred)을 지정하는 걸 통해서 가능하다.

필수 규칙은 반드시 충족되어야 하는 것이며 선호 규칙은 충족되면 가중치를 매겨서 노드가 선택되도록 해준다.

또한 노드 어퍼니티는 In, NotIn, Exists< DoesNotExists, Gt, Lt 같은 언어 연산자를 통해서 좀 더 잘 표현할 수 있게 해준다.

다음은 노드 어퍼니티의 예제이다.

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: # 1
        nodeSelectorTerms:
        - matchExpressions: # 2
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution: # 3
      - weight: 1
        preference:
          matchExpressions: # 4
          - key: another-node-label-key
            operator: In
            values:
            - another-node-label-value
  containers:
  - name: with-node-affinity
    image: k8s.gcr.io/pause:2.0

필수 정책으로 스케쥴링 할 때 필요하며 스케쥴링 동안 노드의 조건이 변경되는건 신경쓰지 않는다는 의미다.
노드 레이블 키가 kubernetes.io/e2e-az-name이고 그 value가 e2e-az1 또는 e2e-az2 값이 있어야 한다는 의미다.
선호 정책으로 스스케쥴링 할 때 필요하며 스케쥴링 동안 노드의 조건이 변경되는건 신경쓰지 않는다는 의미다
노드 레이블 키가 another-node-label-key이고 그 값이 another-node-label-value이면 선호된단 뜻이다.

파드 어퍼니티와 파드 안티어퍼니티

이 매커니즘은 노드 어퍼니티로 할 수 없는 파드간의 의존성 문제를 해결하고 파드를 분산시키거나 함께 배치하도록 하는 방법이다.

파드가 해당 노드에 있는 파드와의 의존성을 기반으로 선택하는 기법이다.

파드 어퍼니티는 노드에 국한되지 않고 다중 토폴로지 레벨에서 정의하는게 가능하다

다음 예제에서 보듯이 topologyKey와 일치하는 레이블을 사용하면 node, rack, cloud zone, cloud region과 같은 도메인과 결합해서 사용하는 것도 가능하다.

apiVersion: v1
kind: Pod
metadata:
  name: with-pod-affinity
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: # 1
      - labelSelector: # 2
          matchExpressions:
          - key: security
            operator: In
            values:
            - S1
        topologyKey: topology.kubernetes.io/zone # 3
    podAntiAffinity: # 4
      preferredDuringSchedulingIgnoredDuringExecution: 
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchExpressions:
            - key: security
              operator: In
              values:
              - S2
          topologyKey: topology.kubernetes.io/zone
  containers:
  - name: with-pod-affinity
    image: k8s.gcr.io/pause:2.0

podAffinity로 같이 있어야 할 파드를 정의한다. required는 필수 정책을 말한다.
labelSelector로 일치하는 키와 값을 가진 파드와 같이 배치하겠다는 레이블 셀렉터다.
2와 일치하는 키와 값을 가진 파드를 실행하는 노드는 topology.kubernetes.io/zone 레이블이 있어야 한다는 의미다.
파드가 같이 있으면 안된다는 podAntiAffinity 규칙이다.

노드 어퍼니티와 유사하게 파드 어퍼니티와 파드 안티 어퍼니티에 대해 각각 requiredDuringSchedulingIgnoredDuringExecution과 preferredDuringSchedulingIgnoredDuringExecution이 있다.

여기서는 ignoredDuringExecution이므로 런타임은 고려하지 않는다는 의미다. 향후에는 이를 변경하는 것도 고려해야한다.

정리

이 외에도 taint와 toleration이 있다. 이는 파드가 노드를 선택하는 노드 어퍼니티와는 반대로 노드가 해당 파드를 선택하는 조건으로 쓰인다.

테인트는 노드의 속성으로 존재하며 파드가 이 테인트의 속성이 있다면 배치할 수 있도록 하고 없으면 배치 못핟로ㅗㄱ 하는 설정도 있다.

또 다른 예로는 애플리케이션의 고가용성과 성능 요구사항에 기초해 스케줄러에 많은 제약을 가해 파드가 스케줄링 되지 못하고 자원은 많이 남아 있는 상태가 이르지 않도록 해야한다.

그리고 스케줄링은 파드를 한번 배치하면 문제가 생겨서 다시 재시작 하는 경우가 아니라면 배치가 바뀌지 않는다.

그러므로 해당 노드에는 리소스를 효율적으로 사용하지 못하는 경우가 발생할 수도 있다. 이는 쿠버네티스 디스케줄러를 통해 노드에 남아있는 리소스 조각 모음을 수행해 활용도를 높이는 방법도 있다.

Jeongmin Yeo (Ethan)

좋은 습관을 가지고 싶은 평범한 개발자입니다.