production 적용을 위한 Dragonfly

진웅·2025년 10월 28일

CNCF.DRAGONFLY

목록 보기
1/2
post-thumbnail

Dragonfly 무엇인가?

  • Dragonfly는 컨테이너 이미지와 파일을 P2P(Peer-to-Peer) 방식으로 분산 배포해주는 CNCF Incubating 프로젝트

Dragonfly 왜 필요한가 ?

  • 운영 중 image pulling 에 너무 오랜 시간이 걸린다는 문의가 온다.
  • 이런 폐쇄망 서비스 운영 문제 생길경우 조건이 타겟 prviate registry 가 한 곳 병목으로 인한 지연이 원인인 경우 빨른 속도로 내부 이미지를 P2P 로 다운 받을 수 있어서 이슈 해소가능하다.
  • 단, 신규 image pull 은 내부 cache 이미지가 없으므로 성능 효과가 없다.

ServiceFlow

배포 시퀀스

P2P 연결 CLUSTER 내부 이미지가 존재하는 경우 (캐시 히트!)

내부에 이미지가 존재하지 않는 경우

Dragonfly 구성요소 및 역할

Scheduler (스케줄러)

역할:
P2P 네트워크의 "두뇌"입니다. 클라이언트(dfdaemon)가 이미지를 요청하면, 어떤 피어(다른 노드의 dfdaemon)에게서 데이터 조각(chunk)을 받아올지 최적의 경로를 결정하고 할당합니다.

운영 고려사항:
전체 P2P 네트워크의 성능과 안정성을 좌우하므로, 고가용성(HA)을 위해 반드시 2개 이상의 Pod으로 운영해야 합니다.


Seed Peer (씨드 피어)

역할:
P2P 네트워크의 "최초의 공급자"입니다. 클러스터 내 누구도 가지고 있지 않은 새로운 이미지가 요청되면, Seed Peer가 가장 먼저 원본 레지스트리(nexus.com)에서 이미지를 다운로드하여 P2P 네트워크에 "씨앗"을 뿌립니다.

운영 고려사항:
원본 레지스트리와의 통신이 많으므로, 안정적인 네트워크 환경에 배치하는 것이 좋습니다. 일반적으로 1~2개의 Pod으로 운영합니다.


dfdaemon (Dragonfly Daemon)

역할:
모든 워커 노드에 배포되는 "로컬 프록시"입니다. containerd가 이미지를 Pull할 때, 원본 레지스트리가 아닌 자신의 노드에 설치된 dfdaemon(주로 127.0.0.1:65001)을 통해 이미지를 요청하도록 설정됩니다.
dfdaemon은 Scheduler에게 피어 목록을 받아 여러 노드에서 데이터 조각을 병렬로 다운로드한 후, 완성된 이미지를 containerd에 전달합니다.

운영 고려사항:
모든 노드에 설치되어야 하므로 DaemonSet으로 배포됩니다. 노드의 네트워크를 직접 사용하기 위해 hostNetwork: true 설정이 권장됩니다.


Manager (매니저)

역할:
Dragonfly 클러스터의 "운영 및 모니터링 허브"입니다. 웹 UI를 통해 P2P 전송 현황, 트래픽 통계, 피어 상태 등을 실시간으로 모니터링하고 관리할 수 있습니다.

운영 고려사항:
운영 편의성을 위해 반드시 설치하는 것을 권장합니다. Scheduler와 마찬가지로 고가용성을 위해 2개 이상의 Pod으로 운영합니다.


🎯 전체 아키텍처

[Kubernetes Cluster - 100 Nodes]
├── Dragonfly
│   ├── Manager (3 replicas) → External MySQL
│   ├── Scheduler (3 replicas) → External Redis
│   ├── Seed Peer (5 replicas)
│   └── Client (100 DaemonSet)
├── External Services
│   ├── MySQL (StatefulSet)
│   └── Redis (StatefulSet)
└── Monitoring
    ├── Prometheus
    └── Grafana

서비스 배포

# 1. Namespace 생성
kubectl create namespace dragonfly-infra

# 2. MySQL 배포
kubectl apply -f mysql-deployment.yaml

# 3. Redis 배포
kubectl apply -f redis-deployment.yaml

# 4. 배포 확인
kubectl get pods -n dragonfly-infra -w

# 예상 출력:
# NAME              READY   STATUS    RESTARTS   AGE
# mysql-0           1/1     Running   0          2m
# redis-0           1/1     Running   0          2m
# mysql-exporter-*  1/1     Running   0          2m
# redis-exporter-*  1/1     Running   0          2m

# 5. MySQL 접속 테스트
kubectl exec -it mysql-0 -n dragonfly-infra -- mysql -udragonfly -pDragonflyPassword123! dragonfly -e "SELECT 1;"

# 6. Redis 접속 테스트
kubectl exec -it redis-0 -n dragonfly-infra -- redis-cli -a RedisPassword123! ping
# 출력: PONG

Dragonfly Helm Values

dragonfly-values.yaml

# ==========================================
# Dragonfly Helm Chart v1.4.15
# App Version: 2.3.3
# Kubernetes: v1.33.3
# 가동계 폐쇄망 환경 (모니터링 전체 활성화)
# ==========================================

# ==========================================
# Global 설정
# ==========================================
global:
  # 폐쇄망 Private Registry
  imageRegistry: "nexus.com"
  imagePullSecrets: []

# ==========================================
# Manager (중앙 관리)
# ==========================================
manager:
  enable: true
  replicas: 3
  
  image:
    repository: nexus.com/dragonflyoss/manager
    tag: v2.3.3
    pullPolicy: IfNotPresent
  
  # 리소스
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  
  # Service
  service:
    type: ClusterIP
    ports:
      - name: http
        port: 8080
        targetPort: 8080
      - name: grpc
        port: 65003
        targetPort: 65003
  
  # 메트릭 활성화 (모니터링)
  metrics:
    enable: true
    port: 8000
    path: /metrics
    serviceMonitor:
      enable: true
      interval: 30s
      scrapeTimeout: 10s
      labels:
        release: prometheus
  
  # 고가용성
  podDisruptionBudget:
    minAvailable: 2
  
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - dragonfly-manager
            topologyKey: kubernetes.io/hostname

# ==========================================
# Scheduler
# ==========================================
scheduler:
  enable: true
  replicas: 3
  
  image:
    repository: nexus.com/dragonflyoss/scheduler
    tag: v2.3.3
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  
  service:
    type: ClusterIP
    ports:
      - name: http
        port: 8002
        targetPort: 8002
  
  # 메트릭 활성화
  metrics:
    enable: true
    port: 8000
    path: /metrics
    serviceMonitor:
      enable: true
      interval: 30s
      scrapeTimeout: 10s
      labels:
        release: prometheus
  
  # 스케줄러 설정
  config:
    scheduler:
      algorithm: default
      backSourceCount: 3
      filterParentLimit: 40
    manager:
      schedulerClusterID: 1
  
  podDisruptionBudget:
    minAvailable: 2
  
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - dragonfly-scheduler
            topologyKey: kubernetes.io/hostname

# ==========================================
# Seed Peer
# ==========================================
seedClient:
  enable: true
  replicas: 5
  
  image:
    repository: nexus.com/dragonflyoss/client
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 2000m
      memory: 4Gi
    limits:
      cpu: 4000m
      memory: 8Gi
  
  # Persistence 필수
  persistence:
    enable: true
    size: 200Gi
    storageClass: "local-path"
    accessModes:
      - ReadWriteOnce
  
  service:
    type: ClusterIP
  
  # 메트릭 활성화
  metrics:
    enable: true
    port: 8000
    path: /metrics
    serviceMonitor:
      enable: true
      interval: 30s
      scrapeTimeout: 10s
      labels:
        release: prometheus
  
  config:
    seedPeer:
      enable: true
      type: "super"
      clusterID: 1
    
    proxy:
      registryMirror:
        addr: https://nexus.com
      disableBackToSource: false
      security:
        insecure: false
        cacert: "/etc/containerd/certs.d/nexus.com/ca.crt"
        cert: "/etc/containerd/certs.d/nexus.com/client.crt"
        key: "/etc/containerd/certs.d/nexus.com/client.key"
    
    download:
      concurrentPieceCount: 16
      pieceDownloadTimeout: 60s
      rateLimit: 0
    
    upload:
      rateLimit: 0
      maxConcurrency: 200
    
    storage:
      dir: /var/lib/dragonfly
      taskExpireTime: 24h
      diskGCThreshold: 85
      diskGCInterval: 30s
      writeBufferSize: 16777216
      readBufferSize: 16777216
  
  volumeMounts:
    - name: containerd-certs
      mountPath: /etc/containerd/certs.d
      readOnly: true
  
  volumes:
    - name: containerd-certs
      hostPath:
        path: /etc/containerd/certs.d
        type: Directory
  
  podDisruptionBudget:
    minAvailable: 3
  
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - dragonfly-seed-client
          topologyKey: kubernetes.io/hostname

# ==========================================
# Client (DaemonSet)
# ==========================================
client:
  enable: true
  
  image:
    repository: nexus.com/dragonflyoss/client
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 1000m
      memory: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi
  
  persistence:
    enable: true
    size: 30Gi
    storageClass: "local-path"
    accessModes:
      - ReadWriteOnce
  
  # 메트릭 활성화
  metrics:
    enable: true
    port: 8000
    path: /metrics
    serviceMonitor:
      enable: true
      interval: 30s
      scrapeTimeout: 10s
      labels:
        release: prometheus
  
  config:
    proxy:
      registryMirror:
        addr: https://nexus.com
      listenAddress: "0.0.0.0:65001"
      disableBackToSource: false
      security:
        insecure: false
        cacert: "/etc/containerd/certs.d/nexus.com/ca.crt"
        cert: "/etc/containerd/certs.d/nexus.com/client.crt"
        key: "/etc/containerd/certs.d/nexus.com/client.key"
      
      # 여러 Registry 지원
      proxies:
        - regx: "nexus.com/*"
          useHTTPS: true
          direct: true
        - regx: "docker.io/*"
          useHTTPS: true
          direct: true
        - regx: "gcr.io/*"
          useHTTPS: true
          direct: true
        - regx: "ghcr.io/*"
          useHTTPS: true
          direct: true
        - regx: "k8s.gcr.io/*"
          useHTTPS: true
          direct: true
        - regx: "quay.io/*"
          useHTTPS: true
          direct: true
        - regx: "registry.k8s.io/*"
          useHTTPS: true
          direct: true
    
    download:
      concurrentPieceCount: 10
      pieceDownloadTimeout: 30s
      downloadTimeout: 10m
      downloadRetryCount: 3
      downloadRetryBackoff: 1s
    
    storage:
      dir: /var/lib/dragonfly
      taskExpireTime: 6h
      diskGCThreshold: 90
      diskGCInterval: 15s
      writeBufferSize: 8388608
      readBufferSize: 8388608
  
  volumeMounts:
    - name: containerd-certs
      mountPath: /etc/containerd/certs.d
      readOnly: true
  
  volumes:
    - name: containerd-certs
      hostPath:
        path: /etc/containerd/certs.d
        type: Directory
  
  enableHost: true

# ==========================================
# dfinit (containerd 자동 설정)
# ==========================================
dfinit:
  enable: true
  restartContainerRuntime: true
  
  image:
    repository: nexus.com/dragonflyoss/dfinit
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  config:
    containerRuntime:
      containerd:
        configPath: /etc/containerd/config.toml
        registries:
          - hostNamespace: nexus.com
            serverAddr: https://nexus.com
            capabilities: ['pull', 'resolve']
          - hostNamespace: docker.io
            serverAddr: https://registry-1.docker.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: gcr.io
            serverAddr: https://gcr.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: ghcr.io
            serverAddr: https://ghcr.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: k8s.gcr.io
            serverAddr: https://k8s.gcr.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: quay.io
            serverAddr: https://quay.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: registry.k8s.io
            serverAddr: https://registry.k8s.io
            capabilities: ['pull', 'resolve']

# ==========================================
# 외부 MySQL 연결
# ==========================================
mysql:
  enable: false

externalMysql:
  migrate: true
  host: mysql.dragonfly-infra.svc.cluster.local
  port: 3306
  username: dragonfly
  password: "DragonflyPassword123!"
  database: dragonfly
  maxOpenConns: 200
  maxIdleConns: 50
  connMaxLifetime: 3600

# ==========================================
# 외부 Redis 연결
# ==========================================
redis:
  enable: false

externalRedis:
  addrs:
    - redis.dragonfly-infra.svc.cluster.local:6379
  password: "RedisPassword123!"
  db: 0
  brokerDB: 1
  backendDB: 2

# ==========================================
# 보안
# ==========================================
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000

📋 Part 3: 변수 기반 설치 스크립트

install-with-variables.sh

#!/bin/bash

# ==========================================
# Dragonfly 설치 스크립트 (변수 기반)
# ==========================================

set -e

# ==========================================
# 설정 변수 (여기만 수정!)
# ==========================================

# 기본 설정
NAMESPACE="dragonfly-system"
INFRA_NAMESPACE="dragonfly-infra"
RELEASE_NAME="dragonfly"
CHART_VERSION="1.4.15"
CHART_PATH="./dragonfly-1.4.15.tgz"

# 이미지 설정
IMAGE_REGISTRY="nexus.com"
MANAGER_IMAGE="${IMAGE_REGISTRY}/dragonflyoss/manager"
SCHEDULER_IMAGE="${IMAGE_REGISTRY}/dragonflyoss/scheduler"
CLIENT_IMAGE="${IMAGE_REGISTRY}/dragonflyoss/client"
DFINIT_IMAGE="${IMAGE_REGISTRY}/dragonflyoss/dfinit"
IMAGE_TAG_MANAGER="v2.3.3"
IMAGE_TAG_SCHEDULER="v2.3.3"
IMAGE_TAG_CLIENT="v0.1.118"
IMAGE_TAG_DFINIT="v0.1.118"

# 스토리지 설정
STORAGE_CLASS="local-path"
SEED_PEER_STORAGE_SIZE="200Gi"
CLIENT_STORAGE_SIZE="30Gi"

# 리소스 설정
MANAGER_REPLICAS=3
SCHEDULER_REPLICAS=3
SEED_PEER_REPLICAS=5

# Manager 리소스
MANAGER_CPU_REQUEST="1000m"
MANAGER_CPU_LIMIT="2000m"
MANAGER_MEM_REQUEST="2Gi"
MANAGER_MEM_LIMIT="4Gi"

# Scheduler 리소스
SCHEDULER_CPU_REQUEST="1000m"
SCHEDULER_CPU_LIMIT="2000m"
SCHEDULER_MEM_REQUEST="2Gi"
SCHEDULER_MEM_LIMIT="4Gi"

# Seed Peer 리소스
SEED_CPU_REQUEST="2000m"
SEED_CPU_LIMIT="4000m"
SEED_MEM_REQUEST="4Gi"
SEED_MEM_LIMIT="8Gi"

# Client 리소스
CLIENT_CPU_REQUEST="1000m"
CLIENT_CPU_LIMIT="2000m"
CLIENT_MEM_REQUEST="1Gi"
CLIENT_MEM_LIMIT="2Gi"

# MySQL 설정
MYSQL_HOST="mysql.${INFRA_NAMESPACE}.svc.cluster.local"
MYSQL_PORT="3306"
MYSQL_USERNAME="dragonfly"
MYSQL_PASSWORD="DragonflyPassword123!"
MYSQL_DATABASE="dragonfly"

# Redis 설정
REDIS_HOST="redis.${INFRA_NAMESPACE}.svc.cluster.local"
REDIS_PORT="6379"
REDIS_PASSWORD="RedisPassword123!"
REDIS_DB="0"
REDIS_BROKER_DB="1"
REDIS_BACKEND_DB="2"

# Registry 설정
REGISTRY_ADDR="https://nexus.com"
REGISTRY_CERT_PATH="/etc/containerd/certs.d/nexus.com"

# Containerd 설정
CONTAINERD_CONFIG_PATH="/etc/containerd/config.toml"
CONTAINERD_CERTS_PATH="/etc/containerd/certs.d"

# 모니터링
ENABLE_METRICS="true"
METRICS_PORT="8000"
METRICS_PATH="/metrics"
SERVICE_MONITOR_ENABLED="true"
SERVICE_MONITOR_INTERVAL="30s"
PROMETHEUS_LABEL="prometheus"

# dfinit 설정
DFINIT_ENABLED="true"
DFINIT_RESTART_CONTAINERD="true"

# ==========================================
# 함수 정의
# ==========================================

log() {
    echo "[$(date +'%Y-%m-%d %H:%M:%S')] $1"
}

check_prerequisites() {
    log "Checking prerequisites..."
    
    # kubectl 확인
    if ! command -v kubectl &> /dev/null; then
        log "ERROR: kubectl not found"
        exit 1
    fi
    
    # helm 확인
    if ! command -v helm &> /dev/null; then
        log "ERROR: helm not found"
        exit 1
    fi
    
    # StorageClass 확인
    if ! kubectl get storageclass "${STORAGE_CLASS}" &> /dev/null; then
        log "WARNING: StorageClass '${STORAGE_CLASS}' not found"
        log "Please create StorageClass or update STORAGE_CLASS variable"
        exit 1
    fi
    
    log "Prerequisites check passed"
}

create_namespace() {
    log "Creating namespaces..."
    kubectl create namespace "${INFRA_NAMESPACE}" --dry-run=client -o yaml | kubectl apply -f -
    kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | kubectl apply -f -
    log "Namespaces created"
}

deploy_mysql() {
    log "Deploying MySQL..."
    
    cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: mysql-config
  namespace: ${INFRA_NAMESPACE}
data:
  my.cnf: |
    [mysqld]
    default-storage-engine=INNODB
    character-set-server=utf8mb4
    collation-server=utf8mb4_unicode_ci
    max_connections=500
    max_allowed_packet=256M
    innodb_buffer_pool_size=2G
    innodb_log_file_size=512M
    innodb_flush_log_at_trx_commit=2
    innodb_flush_method=O_DIRECT
---
apiVersion: v1
kind: Secret
metadata:
  name: mysql-secret
  namespace: ${INFRA_NAMESPACE}
type: Opaque
stringData:
  MYSQL_ROOT_PASSWORD: "RootPassword123!"
  MYSQL_PASSWORD: "${MYSQL_PASSWORD}"
---
apiVersion: v1
kind: Service
metadata:
  name: mysql
  namespace: ${INFRA_NAMESPACE}
spec:
  type: ClusterIP
  ports:
    - port: ${MYSQL_PORT}
      targetPort: ${MYSQL_PORT}
  selector:
    app: mysql
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mysql-pvc
  namespace: ${INFRA_NAMESPACE}
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ${STORAGE_CLASS}
  resources:
    requests:
      storage: 50Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: mysql
  namespace: ${INFRA_NAMESPACE}
spec:
  serviceName: mysql
  replicas: 1
  selector:
    matchLabels:
      app: mysql
  template:
    metadata:
      labels:
        app: mysql
    spec:
      containers:
        - name: mysql
          image: ${IMAGE_REGISTRY}/mysql:8.0
          ports:
            - containerPort: ${MYSQL_PORT}
          env:
            - name: MYSQL_ROOT_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql-secret
                  key: MYSQL_ROOT_PASSWORD
            - name: MYSQL_DATABASE
              value: ${MYSQL_DATABASE}
            - name: MYSQL_USER
              value: ${MYSQL_USERNAME}
            - name: MYSQL_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: mysql-secret
                  key: MYSQL_PASSWORD
          volumeMounts:
            - name: mysql-data
              mountPath: /var/lib/mysql
            - name: mysql-config
              mountPath: /etc/mysql/conf.d
          resources:
            requests:
              cpu: 1000m
              memory: 2Gi
            limits:
              cpu: 2000m
              memory: 4Gi
      volumes:
        - name: mysql-data
          persistentVolumeClaim:
            claimName: mysql-pvc
        - name: mysql-config
          configMap:
            name: mysql-config
EOF
    
    log "Waiting for MySQL to be ready..."
    kubectl wait --for=condition=Ready pod -l app=mysql -n "${INFRA_NAMESPACE}" --timeout=300s
    log "MySQL deployed successfully"
}

deploy_redis() {
    log "Deploying Redis..."
    
    cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
  namespace: ${INFRA_NAMESPACE}
data:
  redis.conf: |
    bind 0.0.0.0
    protected-mode no
    port ${REDIS_PORT}
    requirepass "${REDIS_PASSWORD}"
    maxmemory 2gb
    maxmemory-policy allkeys-lru
    appendonly yes
---
apiVersion: v1
kind: Service
metadata:
  name: redis
  namespace: ${INFRA_NAMESPACE}
spec:
  type: ClusterIP
  ports:
    - port: ${REDIS_PORT}
      targetPort: ${REDIS_PORT}
  selector:
    app: redis
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: redis-pvc
  namespace: ${INFRA_NAMESPACE}
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ${STORAGE_CLASS}
  resources:
    requests:
      storage: 20Gi
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
  namespace: ${INFRA_NAMESPACE}
spec:
  serviceName: redis
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: ${IMAGE_REGISTRY}/redis:7.2
          ports:
            - containerPort: ${REDIS_PORT}
          command:
            - redis-server
            - /usr/local/etc/redis/redis.conf
          volumeMounts:
            - name: redis-data
              mountPath: /data
            - name: redis-config
              mountPath: /usr/local/etc/redis
          resources:
            requests:
              cpu: 500m
              memory: 2Gi
            limits:
              cpu: 1000m
              memory: 4Gi
      volumes:
        - name: redis-data
          persistentVolumeClaim:
            claimName: redis-pvc
        - name: redis-config
          configMap:
            name: redis-config
EOF
    
    log "Waiting for Redis to be ready..."
    kubectl wait --for=condition=Ready pod -l app=redis -n "${INFRA_NAMESPACE}" --timeout=300s
    log "Redis deployed successfully"
}

install_dragonfly() {
    log "Installing Dragonfly..."
    
    helm upgrade --install "${RELEASE_NAME}" "${CHART_PATH}" \
        --namespace "${NAMESPACE}" \
        --set global.imageRegistry="${IMAGE_REGISTRY}" \
        \
        --set manager.enable=true \
        --set manager.replicas="${MANAGER_REPLICAS}" \
        --set manager.image.repository="${MANAGER_IMAGE}" \
        --set manager.image.tag="${IMAGE_TAG_MANAGER}" \
        --set manager.resources.requests.cpu="${MANAGER_CPU_REQUEST}" \
        --set manager.resources.requests.memory="${MANAGER_MEM_REQUEST}" \
        --set manager.resources.limits.cpu="${MANAGER_CPU_LIMIT}" \
        --set manager.resources.limits.memory="${MANAGER_MEM_LIMIT}" \
        --set manager.metrics.enable="${ENABLE_METRICS}" \
        --set manager.metrics.port="${METRICS_PORT}" \
        --set manager.metrics.serviceMonitor.enable="${SERVICE_MONITOR_ENABLED}" \
        --set manager.metrics.serviceMonitor.interval="${SERVICE_MONITOR_INTERVAL}" \
        --set manager.metrics.serviceMonitor.labels.release="${PROMETHEUS_LABEL}" \
        \
        --set scheduler.enable=true \
        --set scheduler.replicas="${SCHEDULER_REPLICAS}" \
        --set scheduler.image.repository="${SCHEDULER_IMAGE}" \
        --set scheduler.image.tag="${IMAGE_TAG_SCHEDULER}" \
        --set scheduler.resources.requests.cpu="${SCHEDULER_CPU_REQUEST}" \
        --set scheduler.resources.requests.memory="${SCHEDULER_MEM_REQUEST}" \
        --set scheduler.resources.limits.cpu="${SCHEDULER_CPU_LIMIT}" \
        --set scheduler.resources.limits.memory="${SCHEDULER_MEM_LIMIT}" \
        --set scheduler.metrics.enable="${ENABLE_METRICS}" \
        --set scheduler.metrics.serviceMonitor.enable="${SERVICE_MONITOR_ENABLED}" \
        --set scheduler.metrics.serviceMonitor.labels.release="${PROMETHEUS_LABEL}" \
        \
        --set seedClient.enable=true \
        --set seedClient.replicas="${SEED_PEER_REPLICAS}" \
        --set seedClient.image.repository="${CLIENT_IMAGE}" \
        --set seedClient.image.tag="${IMAGE_TAG_CLIENT}" \
        --set seedClient.persistence.enable=true \
        --set seedClient.persistence.size="${SEED_PEER_STORAGE_SIZE}" \
        --set seedClient.persistence.storageClass="${STORAGE_CLASS}" \
        --set seedClient.resources.requests.cpu="${SEED_CPU_REQUEST}" \
        --set seedClient.resources.requests.memory="${SEED_MEM_REQUEST}" \
        --set seedClient.resources.limits.cpu="${SEED_CPU_LIMIT}" \
        --set seedClient.resources.limits.memory="${SEED_MEM_LIMIT}" \
        --set seedClient.metrics.enable="${ENABLE_METRICS}" \
        --set seedClient.metrics.serviceMonitor.enable="${SERVICE_MONITOR_ENABLED}" \
        --set seedClient.metrics.serviceMonitor.labels.release="${PROMETHEUS_LABEL}" \
        \
        --set client.enable=true \
        --set client.image.repository="${CLIENT_IMAGE}" \
        --set client.image.tag="${IMAGE_TAG_CLIENT}" \
        --set client.persistence.enable=true \
        --set client.persistence.size="${CLIENT_STORAGE_SIZE}" \
        --set client.persistence.storageClass="${STORAGE_CLASS}" \
        --set client.resources.requests.cpu="${CLIENT_CPU_REQUEST}" \
        --set client.resources.requests.memory="${CLIENT_MEM_REQUEST}" \
        --set client.resources.limits.cpu="${CLIENT_CPU_LIMIT}" \
        --set client.resources.limits.memory="${CLIENT_MEM_LIMIT}" \
        --set client.metrics.enable="${ENABLE_METRICS}" \
        --set client.metrics.serviceMonitor.enable="${SERVICE_MONITOR_ENABLED}" \
        --set client.metrics.serviceMonitor.labels.release="${PROMETHEUS_LABEL}" \
        --set client.enableHost=true \
        \
        --set dfinit.enable="${DFINIT_ENABLED}" \
        --set dfinit.restartContainerRuntime="${DFINIT_RESTART_CONTAINERD}" \
        --set dfinit.image.repository="${DFINIT_IMAGE}" \
        --set dfinit.image.tag="${IMAGE_TAG_DFINIT}" \
        \
        --set mysql.enable=false \
        --set externalMysql.migrate=true \
        --set externalMysql.host="${MYSQL_HOST}" \
        --set externalMysql.port="${MYSQL_PORT}" \
        --set externalMysql.username="${MYSQL_USERNAME}" \
        --set externalMysql.password="${MYSQL_PASSWORD}" \
        --set externalMysql.database="${MYSQL_DATABASE}" \
        \
        --set redis.enable=false \
        --set externalRedis.addrs[0]="${REDIS_HOST}:${REDIS_PORT}" \
        --set externalRedis.password="${REDIS_PASSWORD}" \
        --set externalRedis.db="${REDIS_DB}" \
        --set externalRedis.brokerDB="${REDIS_BROKER_DB}" \
        --set externalRedis.backendDB="${REDIS_BACKEND_DB}" \
        \
        --wait \
        --timeout 15m
    
    log "Dragonfly installed successfully"
}

verify_installation() {
    log "Verifying installation..."
    
    log "MySQL status:"
    kubectl get pods -n "${INFRA_NAMESPACE}" -l app=mysql
    
    log "Redis status:"
    kubectl get pods -n "${INFRA_NAMESPACE}" -l app=redis
    
    log "Dragonfly status:"
    kubectl get pods -n "${NAMESPACE}"
    
    log "Verification complete"
}

# ==========================================
# 메인 실행
# ==========================================

main() {
    log "Starting Dragonfly deployment..."
    log "Configuration:"
    log "  Namespace: ${NAMESPACE}"
    log "  Infra Namespace: ${INFRA_NAMESPACE}"
    log "  Image Registry: ${IMAGE_REGISTRY}"
    log "  Storage Class: ${STORAGE_CLASS}"
    log "  Manager Replicas: ${MANAGER_REPLICAS}"
    log "  Scheduler Replicas: ${SCHEDULER_REPLICAS}"
    log "  Seed Peer Replicas: ${SEED_PEER_REPLICAS}"
    
    check_prerequisites
    create_namespace
    deploy_mysql
    deploy_redis
    install_dragonfly
    verify_installation
    
    log "========================================="
    log "Dragonfly deployment completed!"
    log "========================================="
    log "Next steps:"
    log "1. Check pods: kubectl get pods -n ${NAMESPACE}"
    log "2. Check services: kubectl get svc -n ${NAMESPACE}"
    log "3. Check metrics: kubectl port-forward svc/dragonfly-manager -n ${NAMESPACE} 8000:8000"
    log "4. Test image pull: kubectl run test --image=${IMAGE_REGISTRY}/nginx:latest"
}

# 스크립트 실행
main "$@"

사용 방법

# 1. 실행 권한 부여
chmod +x install-with-variables.sh

# 2. 변수 수정
# install-with-variables.sh 파일 상단의 변수 섹션 수정

# 3. 설치 실행
./install-with-variables.sh

# 4. 로그 확인
./install-with-variables.sh 2>&1 | tee install.log

📊 Part 4: 모니터링 대시보드 설정

Prometheus AlertRules

# dragonfly-alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: dragonfly-alerts
  namespace: dragonfly-system
  labels:
    release: prometheus
spec:
  groups:
    - name: dragonfly
      interval: 30s
      rules:
        # Manager Down
        - alert: DragonflyManagerDown
          expr: up{job="dragonfly-manager"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Dragonfly Manager down"
            description: "Manager {{ $labels.pod }} is down for >5min"
        
        # Scheduler Down
        - alert: DragonflySchedulerDown
          expr: up{job="dragonfly-scheduler"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Dragonfly Scheduler down"
        
        # Seed Peer Low Count
        - alert: LowSeedPeerCount
          expr: count(up{job="dragonfly-seed-client"} == 1) < 3
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Low Seed Peer count (<3)"
        
        # Low Cache Hit Rate
        - alert: LowCacheHitRate
          expr: |
            (sum(rate(dragonfly_client_cache_hit_total[5m])) 
            / 
            (sum(rate(dragonfly_client_cache_hit_total[5m])) + sum(rate(dragonfly_client_cache_miss_total[5m])))) 
            < 0.5
          for: 30m
          labels:
            severity: warning
          annotations:
            summary: "Cache hit rate <50%"
        
        # High Task Failure Rate
        - alert: HighTaskFailureRate
          expr: |
            (sum(rate(dragonfly_scheduler_tasks_total{state="failed"}[5m])) 
            / 
            sum(rate(dragonfly_scheduler_tasks_total[5m]))) 
            > 0.1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Task failure rate >10%"
        
        # High Disk Usage
        - alert: HighDiskUsage
          expr: |
            dragonfly_client_disk_usage_bytes / dragonfly_client_disk_capacity_bytes > 0.9
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Disk usage >90%"
        
        # Low P2P Efficiency
        - alert: LowP2PEfficiency
          expr: |
            (sum(rate(dragonfly_client_download_piece_total{source_type="peer"}[5m])) 
            / 
            sum(rate(dragonfly_client_download_piece_total[5m]))) 
            < 0.7
          for: 30m
          labels:
            severity: info
          annotations:
            summary: "P2P efficiency <70%"
        
        # MySQL Connection Errors
        - alert: MySQLConnectionError
          expr: mysql_global_status_aborted_connects > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "MySQL connection errors"
        
        # Redis High Memory
        - alert: RedisHighMemory
          expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Redis memory usage >90%"

Grafana Dashboard JSON

# Grafana Dashboard Import

# 1. Grafana 접속
kubectl port-forward svc/prometheus-grafana -n monitoring 3000:80

# 2. 브라우저 열기: http://localhost:3000
# ID: admin
# PW: (prometheus-grafana secret에서 확인)

# 3. Dashboard Import
# - 좌측 메뉴 > Dashboards > Import
# - Dashboard ID: 자체 제작 필요 또는 아래 참고

주요 패널 구성:

Row 1: Overview
- Total Nodes
- Active Tasks
- Cache Hit Rate
- P2P Download Ratio

Row 2: Manager & Scheduler
- Manager Request Rate
- Manager Error Rate
- Scheduler Task Rate
- Scheduler Duration

Row 3: Seed Peer
- Seed Peer Count
- Seed Peer Disk Usage
- Upload Traffic
- Cache Size

Row 4: Client
- Client Count
- Download Speed
- Cache Hit Rate
- Disk Usage

Row 5: Infrastructure
- MySQL Connections
- MySQL Query Rate
- Redis Memory
- Redis Commands/sec

✅ 설치 후 확인

# 1. 모든 Pod 확인
kubectl get pods -n dragonfly-infra
kubectl get pods -n dragonfly-system

# 2. Service 확인
kubectl get svc -n dragonfly-infra
kubectl get svc -n dragonfly-system

# 3. ServiceMonitor 확인
kubectl get servicemonitor -n dragonfly-system

# 4. MySQL 테스트
kubectl exec -it mysql-0 -n dragonfly-infra -- mysql -udragonfly -pDragonflyPassword123! -e "SHOW DATABASES;"

# 5. Redis 테스트
kubectl exec -it redis-0 -n dragonfly-infra -- redis-cli -a RedisPassword123! ping

# 6. 메트릭 확인
kubectl port-forward svc/dragonfly-manager -n dragonfly-system 8000:8000
curl http://localhost:8000/metrics

# 7. Prometheus Target 확인
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
# http://localhost:9090/targets

# 8. 기능 테스트
kubectl run test-nginx --image=nexus.com/nginx:latest
kubectl logs -f -n dragonfly-system -l app=dragonfly-client | grep download

정리

제공된 파일:
1. ✅ dragonfly-values.yaml - 완전한 Helm values
2. ✅ install-with-variables.sh - 변수 기반 설치 스크립트
3. ✅ mysql-deployment.yaml - MySQL 배포
4. ✅ redis-deployment.yaml - Redis 배포
5. ✅ dragonfly-alerts.yaml - Prometheus 알림

핵심 특징:

  • ✅ 모든 설정 변수화
  • ✅ 외부 MySQL/Redis 자동 배포
  • ✅ 모니터링 완전 활성화
  • ✅ 가동계 안전 고려
  • ✅ 폐쇄망 완벽 지원

5대 테스트 → 전체 적용을 위한 단계 테스트 진행

전체 배포 전략

Phase 1: 인프라 구축 (MySQL/Redis)
  └─ 한 번만 구축, 이후 재사용

Phase 2: 테스트 배포 (5대 노드)
  ├─ Manager (1 replica)
  ├─ Scheduler (1 replica)
  ├─ Seed Peer (2 replicas)
  └─ Client (5대만)

Phase 3: 검증 및 최적화
  └─ 1~2주 모니터링

Phase 4: 전체 확장 (100대)
  ├─ Manager (3 replicas)
  ├─ Scheduler (3 replicas)
  ├─ Seed Peer (5 replicas)
  └─ Client (100대)

📋 Phase 1: 인프라 구축 (한 번만)

1-1. MySQL/Redis 배포

# 인프라는 한 번만 배포
# 테스트/프로덕션 모두 사용

# 1. Namespace 생성
kubectl create namespace dragonfly-infra

# 2. MySQL 배포
kubectl apply -f mysql-deployment.yaml

# 3. Redis 배포
kubectl apply -f redis-deployment.yaml

# 4. 확인
kubectl get pods -n dragonfly-infra -w

# 예상 출력:
# NAME              READY   STATUS    RESTARTS   AGE
# mysql-0           1/1     Running   0          2m
# redis-0           1/1     Running   0          2m

📋 Phase 2: 테스트 배포 (5대 노드)

2-1. 테스트 노드 선정 및 라벨링

# 테스트 노드 5대 선정
TEST_NODES=(
  "worker-node-1"
  "worker-node-2"
  "worker-node-3"
  "worker-node-4"
  "worker-node-5"
)

# 라벨 추가
for node in "${TEST_NODES[@]}"; do
  kubectl label node $node dragonfly-phase=test
done

# 확인
kubectl get nodes -l dragonfly-phase=test

2-2. 테스트용 Values 파일

# dragonfly-test-values.yaml

# ==========================================
# Phase 2: 테스트 배포 (5대 노드)
# ==========================================

global:
  imageRegistry: "nexus.com"

# ==========================================
# Manager (소규모)
# ==========================================
manager:
  enable: true
  replicas: 1  # 테스트는 1개
  
  image:
    repository: nexus.com/dragonflyoss/manager
    tag: v2.3.3
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 500m    # 테스트는 절반
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 2Gi
  
  service:
    type: ClusterIP
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus

# ==========================================
# Scheduler (소규모)
# ==========================================
scheduler:
  enable: true
  replicas: 1  # 테스트는 1개
  
  image:
    repository: nexus.com/dragonflyoss/scheduler
    tag: v2.3.3
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1000m
      memory: 2Gi
  
  service:
    type: ClusterIP
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus
  
  config:
    scheduler:
      algorithm: default
      backSourceCount: 3
    manager:
      schedulerClusterID: 1

# ==========================================
# Seed Peer (소규모)
# ==========================================
seedClient:
  enable: true
  replicas: 2  # 테스트는 2개
  
  image:
    repository: nexus.com/dragonflyoss/client
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  
  persistence:
    enable: true
    size: 50Gi  # 테스트는 작게
    storageClass: "local-path"
    accessModes:
      - ReadWriteOnce
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus
  
  config:
    seedPeer:
      enable: true
      type: "super"
      clusterID: 1
    
    proxy:
      registryMirror:
        addr: https://nexus.com
      disableBackToSource: false
      security:
        insecure: false
        cacert: "/etc/containerd/certs.d/nexus.com/ca.crt"
        cert: "/etc/containerd/certs.d/nexus.com/client.crt"
        key: "/etc/containerd/certs.d/nexus.com/client.key"
    
    download:
      concurrentPieceCount: 10
      pieceDownloadTimeout: 60s
    
    upload:
      rateLimit: 0
      maxConcurrency: 100
    
    storage:
      dir: /var/lib/dragonfly
      taskExpireTime: 12h  # 테스트 기간
      diskGCThreshold: 85
  
  volumeMounts:
    - name: containerd-certs
      mountPath: /etc/containerd/certs.d
      readOnly: true
  
  volumes:
    - name: containerd-certs
      hostPath:
        path: /etc/containerd/certs.d
        type: Directory
  
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - dragonfly-seed-client
          topologyKey: kubernetes.io/hostname

# ==========================================
# Client (테스트 노드만!) 🔥
# ==========================================
client:
  enable: true
  
  # 중요: 테스트 노드만 선택!
  nodeSelector:
    dragonfly-phase: test  # 5대만!
  
  image:
    repository: nexus.com/dragonflyoss/client
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 1000m
      memory: 1Gi
  
  persistence:
    enable: true
    size: 20Gi  # 테스트는 작게
    storageClass: "local-path"
    accessModes:
      - ReadWriteOnce
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus
  
  config:
    proxy:
      registryMirror:
        addr: https://nexus.com
      listenAddress: "0.0.0.0:65001"
      disableBackToSource: false
      security:
        insecure: false
        cacert: "/etc/containerd/certs.d/nexus.com/ca.crt"
        cert: "/etc/containerd/certs.d/nexus.com/client.crt"
        key: "/etc/containerd/certs.d/nexus.com/client.key"
      
      proxies:
        - regx: "nexus.com/*"
          useHTTPS: true
          direct: true
        - regx: "docker.io/*"
          useHTTPS: true
          direct: true
    
    download:
      concurrentPieceCount: 10
      pieceDownloadTimeout: 30s
      downloadTimeout: 10m
    
    storage:
      dir: /var/lib/dragonfly
      taskExpireTime: 6h
      diskGCThreshold: 90
  
  volumeMounts:
    - name: containerd-certs
      mountPath: /etc/containerd/certs.d
      readOnly: true
  
  volumes:
    - name: containerd-certs
      hostPath:
        path: /etc/containerd/certs.d
        type: Directory
  
  enableHost: true

# ==========================================
# dfinit (테스트 노드만!)
# ==========================================
dfinit:
  enable: true
  restartContainerRuntime: true
  
  # 중요: 테스트 노드만!
  nodeSelector:
    dragonfly-phase: test
  
  image:
    repository: nexus.com/dragonflyoss/dfinit
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  config:
    containerRuntime:
      containerd:
        configPath: /etc/containerd/config.toml
        registries:
          - hostNamespace: nexus.com
            serverAddr: https://nexus.com
            capabilities: ['pull', 'resolve']
          - hostNamespace: docker.io
            serverAddr: https://registry-1.docker.io
            capabilities: ['pull', 'resolve']

# ==========================================
# 외부 MySQL (기구축 사용)
# ==========================================
mysql:
  enable: false

externalMysql:
  migrate: true
  host: mysql.dragonfly-infra.svc.cluster.local
  port: 3306
  username: dragonfly
  password: "DragonflyPassword123!"
  database: dragonfly
  maxOpenConns: 50   # 테스트는 작게
  maxIdleConns: 10

# ==========================================
# 외부 Redis (기구축 사용)
# ==========================================
redis:
  enable: false

externalRedis:
  addrs:
    - redis.dragonfly-infra.svc.cluster.local:6379
  password: "RedisPassword123!"
  db: 0
  brokerDB: 1
  backendDB: 2

2-3. 테스트 설치 스크립트

# install-test.sh
#!/bin/bash

set -e

echo "================================================"
echo "Phase 2: Dragonfly 테스트 배포 (5대 노드)"
echo "================================================"

# 변수
NAMESPACE="dragonfly-system"
RELEASE_NAME="dragonfly"
CHART_PATH="./dragonfly-1.4.15.tgz"
VALUES_FILE="./dragonfly-test-values.yaml"

# Namespace 생성
echo "[1/5] Creating namespace..."
kubectl create namespace ${NAMESPACE} --dry-run=client -o yaml | kubectl apply -f -

# Helm 설치
echo "[2/5] Installing Dragonfly..."
helm upgrade --install ${RELEASE_NAME} ${CHART_PATH} \
  --namespace ${NAMESPACE} \
  --values ${VALUES_FILE} \
  --wait \
  --timeout 10m

# 상태 확인
echo "[3/5] Checking pod status..."
kubectl get pods -n ${NAMESPACE} -o wide

# Client가 5대만 배포되었는지 확인
echo "[4/5] Verifying Client DaemonSet (should be 5 pods)..."
CLIENT_COUNT=$(kubectl get pods -n ${NAMESPACE} -l component=client --no-headers | wc -l)
echo "Client pods: ${CLIENT_COUNT}"

if [ "${CLIENT_COUNT}" -ne 5 ]; then
  echo "WARNING: Expected 5 Client pods, but found ${CLIENT_COUNT}"
fi

# dfinit 확인
echo "[5/5] Checking dfinit job..."
kubectl get job -n ${NAMESPACE} -l component=dfinit

echo "================================================"
echo "Phase 2 테스트 배포 완료!"
echo "================================================"
echo ""
echo "다음 단계:"
echo "1. 기능 테스트: ./test-functionality.sh"
echo "2. 성능 측정: ./test-performance.sh"
echo "3. 1~2주 모니터링"
echo "4. 문제 없으면 Phase 4 전체 배포"

2-4. 테스트 실행

# 실행 권한
chmod +x install-test.sh

# 설치
./install-test.sh

# 확인
kubectl get pods -n dragonfly-system -o wide

# 예상 출력:
# NAME                          READY   STATUS    NODE
# dragonfly-manager-0           1/1     Running   worker-node-1
# dragonfly-scheduler-0         1/1     Running   worker-node-2
# dragonfly-seed-client-0       1/1     Running   worker-node-3
# dragonfly-seed-client-1       1/1     Running   worker-node-4
# dragonfly-client-xxxxx        1/1     Running   worker-node-1  # 5개만
# dragonfly-client-xxxxx        1/1     Running   worker-node-2
# dragonfly-client-xxxxx        1/1     Running   worker-node-3
# dragonfly-client-xxxxx        1/1     Running   worker-node-4
# dragonfly-client-xxxxx        1/1     Running   worker-node-5

📊 Phase 3: 검증 및 최적화

3-1. 기능 테스트 스크립트

# test-functionality.sh
#!/bin/bash

set -e

NAMESPACE="dragonfly-system"

echo "================================================"
echo "Dragonfly 기능 테스트"
echo "================================================"

# 1. 기본 동작 테스트
echo "[Test 1/5] 기본 이미지 Pull 테스트..."
kubectl run test-nginx-1 --image=nexus.com/nginx:latest \
  --overrides='{"spec":{"nodeSelector":{"dragonfly-phase":"test"}}}'

sleep 10
kubectl wait --for=condition=Ready pod/test-nginx-1 --timeout=120s
echo "✅ Test 1 통과"

# 2. P2P 동작 확인
echo "[Test 2/5] P2P 동작 확인..."
kubectl logs -n ${NAMESPACE} -l component=client --tail=50 | grep -i "download from peer"
if [ $? -eq 0 ]; then
  echo "✅ Test 2 통과 (P2P 동작 확인)"
else
  echo "⚠️ Test 2: P2P 동작 확인 불가 (첫 다운로드일 수 있음)"
fi

# 3. 캐시 히트 테스트
echo "[Test 3/5] 캐시 히트 테스트..."
kubectl delete pod test-nginx-1
sleep 5
kubectl run test-nginx-2 --image=nexus.com/nginx:latest \
  --overrides='{"spec":{"nodeSelector":{"dragonfly-phase":"test"}}}'

sleep 10
kubectl logs -n ${NAMESPACE} -l component=client --tail=50 | grep -E "(cache hit|download from peer)"
if [ $? -eq 0 ]; then
  echo "✅ Test 3 통과 (캐시 히트 확인)"
else
  echo "⚠️ Test 3: 캐시 동작 확인 필요"
fi

# 4. Fallback 테스트
echo "[Test 4/5] Fallback 테스트 (Seed Peer 중단)..."
kubectl scale deployment dragonfly-seed-client --replicas=0 -n ${NAMESPACE}
sleep 10

kubectl run test-nginx-fallback --image=nexus.com/busybox:latest \
  --overrides='{"spec":{"nodeSelector":{"dragonfly-phase":"test"}}}' \
  -- sleep 3600

sleep 20
kubectl wait --for=condition=Ready pod/test-nginx-fallback --timeout=120s
if [ $? -eq 0 ]; then
  echo "✅ Test 4 통과 (Fallback 정상 동작)"
else
  echo "❌ Test 4 실패 (Fallback 문제)"
fi

# Seed Peer 복구
kubectl scale deployment dragonfly-seed-client --replicas=2 -n ${NAMESPACE}
sleep 20

# 5. 여러 Registry 테스트
echo "[Test 5/5] 여러 Registry 테스트..."
kubectl run test-docker-io --image=nexus.com/library/alpine:latest \
  --overrides='{"spec":{"nodeSelector":{"dragonfly-phase":"test"}}}'
sleep 10

# 정리
echo "[Cleanup] 테스트 Pod 정리..."
kubectl delete pod test-nginx-2 test-nginx-fallback test-docker-io --ignore-not-found=true

echo "================================================"
echo "기능 테스트 완료!"
echo "================================================"

3-2. 성능 측정 스크립트

# test-performance.sh
#!/bin/bash

set -e

NAMESPACE="dragonfly-system"
IMAGE="nexus.com/test-app:large"  # 큰 이미지 (1GB+)
TEST_NODES=5

echo "================================================"
echo "Dragonfly 성능 측정 (5대 노드)"
echo "================================================"

# 캐시 초기화
echo "[Prep] 캐시 초기화..."
for pod in $(kubectl get pods -n ${NAMESPACE} -l component=client -o name); do
  kubectl exec -n ${NAMESPACE} ${pod} -- rm -rf /var/lib/dragonfly/storage/tasks/* 2>/dev/null || true
done

# Test 1: Cold Start
echo ""
echo "[Test 1/2] Cold Start (첫 다운로드)"
START=$(date +%s)

for i in $(seq 1 ${TEST_NODES}); do
  kubectl run perf-test-cold-${i} --image=${IMAGE} \
    --overrides='{"spec":{"nodeSelector":{"dragonfly-phase":"test"}}}' &
done
wait

kubectl wait --for=condition=Ready pod -l run=perf-test-cold --timeout=600s
END=$(date +%s)
COLD_TIME=$((END - START))

echo "Cold Start Time: ${COLD_TIME}s"

# Pod 삭제
kubectl delete pod -l run=perf-test-cold

sleep 10

# Test 2: Cache Hit
echo ""
echo "[Test 2/2] Cache Hit (캐시 사용)"
START=$(date +%s)

for i in $(seq 1 ${TEST_NODES}); do
  kubectl run perf-test-cache-${i} --image=${IMAGE} \
    --overrides='{"spec":{"nodeSelector":{"dragonfly-phase":"test"}}}' &
done
wait

kubectl wait --for=condition=Ready pod -l run=perf-test-cache --timeout=600s
END=$(date +%s)
CACHE_TIME=$((END - START))

echo "Cache Hit Time: ${CACHE_TIME}s"

# 결과 정리
kubectl delete pod -l run=perf-test-cache

echo ""
echo "================================================"
echo "성능 측정 결과"
echo "================================================"
echo "Cold Start:  ${COLD_TIME}s"
echo "Cache Hit:   ${CACHE_TIME}s"
echo "Speedup:     $((COLD_TIME / CACHE_TIME))x"
echo "================================================"

3-3. 일일 모니터링 스크립트

# daily-check.sh
#!/bin/bash

NAMESPACE="dragonfly-system"
INFRA_NS="dragonfly-infra"

echo "========== Dragonfly 일일 체크 $(date) =========="

# 1. Pod 상태
echo ""
echo "=== Pod Status ==="
kubectl get pods -n ${NAMESPACE} -o wide

# 2. 리소스 사용량
echo ""
echo "=== Resource Usage ==="
kubectl top pods -n ${NAMESPACE} 2>/dev/null || echo "Metrics server not available"

# 3. 캐시 히트율
echo ""
echo "=== Cache Metrics ==="
for pod in $(kubectl get pods -n ${NAMESPACE} -l component=client -o name | head -1); do
  kubectl exec -n ${NAMESPACE} ${pod} -- curl -s http://localhost:8000/metrics 2>/dev/null | \
    grep -E "dragonfly_client_cache_(hit|miss)_total" || echo "Metrics not available"
done

# 4. 디스크 사용량
echo ""
echo "=== Disk Usage ==="
kubectl exec -n ${NAMESPACE} dragonfly-seed-client-0 -- df -h /var/lib/dragonfly 2>/dev/null || echo "N/A"

# 5. 최근 에러
echo ""
echo "=== Recent Errors ==="
kubectl logs -n ${NAMESPACE} --tail=50 -l component=client 2>/dev/null | grep -i error | tail -10 || echo "No errors"

# 6. MySQL/Redis 상태
echo ""
echo "=== Infrastructure Status ==="
kubectl get pods -n ${INFRA_NS}

echo ""
echo "================================================"
# Cron으로 매일 실행
chmod +x daily-check.sh
# crontab -e
# 0 9 * * * /path/to/daily-check.sh >> /var/log/dragonfly-daily.log 2>&1

🚀 Phase 4: 전체 배포 (100대)

4-1. 검증 완료 후

# 1~2주 테스트 결과 체크리스트
# ✅ Pod 안정성 (재시작 없음)
# ✅ 캐시 히트율 >50%
# ✅ P2P 동작 확인
# ✅ Fallback 정상 동작
# ✅ 성능 개선 확인
# ✅ 에러 로그 없음
# ✅ 리소스 사용량 정상

4-2. 프로덕션 Values 파일

# dragonfly-production-values.yaml

# ==========================================
# Phase 4: 전체 배포 (100대 노드)
# ==========================================

global:
  imageRegistry: "nexus.com"

# ==========================================
# Manager (프로덕션)
# ==========================================
manager:
  enable: true
  replicas: 3  # 테스트: 1 → 프로덕션: 3
  
  image:
    repository: nexus.com/dragonflyoss/manager
    tag: v2.3.3
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 1000m   # 테스트: 500m → 프로덕션: 1000m
      memory: 2Gi  # 테스트: 1Gi → 프로덕션: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  
  service:
    type: ClusterIP
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus
  
  podDisruptionBudget:
    minAvailable: 2
  
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - dragonfly-manager
            topologyKey: kubernetes.io/hostname

# ==========================================
# Scheduler (프로덕션)
# ==========================================
scheduler:
  enable: true
  replicas: 3  # 테스트: 1 → 프로덕션: 3
  
  image:
    repository: nexus.com/dragonflyoss/scheduler
    tag: v2.3.3
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 1000m
      memory: 2Gi
    limits:
      cpu: 2000m
      memory: 4Gi
  
  service:
    type: ClusterIP
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus
  
  config:
    scheduler:
      algorithm: default
      backSourceCount: 3
      filterParentLimit: 40
    manager:
      schedulerClusterID: 1
  
  podDisruptionBudget:
    minAvailable: 2
  
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                    - dragonfly-scheduler
            topologyKey: kubernetes.io/hostname

# ==========================================
# Seed Peer (프로덕션)
# ==========================================
seedClient:
  enable: true
  replicas: 5  # 테스트: 2 → 프로덕션: 5
  
  image:
    repository: nexus.com/dragonflyoss/client
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 2000m   # 테스트: 1000m → 프로덕션: 2000m
      memory: 4Gi  # 테스트: 2Gi → 프로덕션: 4Gi
    limits:
      cpu: 4000m
      memory: 8Gi
  
  persistence:
    enable: true
    size: 200Gi  # 테스트: 50Gi → 프로덕션: 200Gi
    storageClass: "local-path"
    accessModes:
      - ReadWriteOnce
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus
  
  config:
    seedPeer:
      enable: true
      type: "super"
      clusterID: 1
    
    proxy:
      registryMirror:
        addr: https://nexus.com
      disableBackToSource: false
      security:
        insecure: false
        cacert: "/etc/containerd/certs.d/nexus.com/ca.crt"
        cert: "/etc/containerd/certs.d/nexus.com/client.crt"
        key: "/etc/containerd/certs.d/nexus.com/client.key"
    
    download:
      concurrentPieceCount: 16  # 테스트: 10 → 프로덕션: 16
      pieceDownloadTimeout: 60s
      rateLimit: 0
    
    upload:
      rateLimit: 0
      maxConcurrency: 200  # 테스트: 100 → 프로덕션: 200
    
    storage:
      dir: /var/lib/dragonfly
      taskExpireTime: 24h  # 테스트: 12h → 프로덕션: 24h
      diskGCThreshold: 85
      diskGCInterval: 30s
      writeBufferSize: 16777216
      readBufferSize: 16777216
  
  volumeMounts:
    - name: containerd-certs
      mountPath: /etc/containerd/certs.d
      readOnly: true
  
  volumes:
    - name: containerd-certs
      hostPath:
        path: /etc/containerd/certs.d
        type: Directory
  
  podDisruptionBudget:
    minAvailable: 3
  
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: app
                operator: In
                values:
                  - dragonfly-seed-client
          topologyKey: kubernetes.io/hostname

# ==========================================
# Client (전체 노드!) 🔥
# ==========================================
client:
  enable: true
  
  # 중요: nodeSelector 제거 → 모든 노드에 배포!
  # nodeSelector:
  #   dragonfly-phase: test  # 이 줄 삭제!
  
  image:
    repository: nexus.com/dragonflyoss/client
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  resources:
    requests:
      cpu: 1000m   # 테스트: 500m → 프로덕션: 1000m
      memory: 1Gi  # 테스트: 512Mi → 프로덕션: 1Gi
    limits:
      cpu: 2000m
      memory: 2Gi
  
  persistence:
    enable: true
    size: 30Gi  # 테스트: 20Gi → 프로덕션: 30Gi
    storageClass: "local-path"
    accessModes:
      - ReadWriteOnce
  
  metrics:
    enable: true
    port: 8000
    serviceMonitor:
      enable: true
      interval: 30s
      labels:
        release: prometheus
  
  config:
    proxy:
      registryMirror:
        addr: https://nexus.com
      listenAddress: "0.0.0.0:65001"
      disableBackToSource: false
      security:
        insecure: false
        cacert: "/etc/containerd/certs.d/nexus.com/ca.crt"
        cert: "/etc/containerd/certs.d/nexus.com/client.crt"
        key: "/etc/containerd/certs.d/nexus.com/client.key"
      
      proxies:
        - regx: "nexus.com/*"
          useHTTPS: true
          direct: true
        - regx: "docker.io/*"
          useHTTPS: true
          direct: true
        - regx: "gcr.io/*"
          useHTTPS: true
          direct: true
        - regx: "ghcr.io/*"
          useHTTPS: true
          direct: true
        - regx: "k8s.gcr.io/*"
          useHTTPS: true
          direct: true
        - regx: "quay.io/*"
          useHTTPS: true
          direct: true
        - regx: "registry.k8s.io/*"
          useHTTPS: true
          direct: true
    
    download:
      concurrentPieceCount: 10
      pieceDownloadTimeout: 30s
      downloadTimeout: 10m
      downloadRetryCount: 3
    
    storage:
      dir: /var/lib/dragonfly
      taskExpireTime: 6h
      diskGCThreshold: 90
      diskGCInterval: 15s
      writeBufferSize: 8388608
      readBufferSize: 8388608
  
  volumeMounts:
    - name: containerd-certs
      mountPath: /etc/containerd/certs.d
      readOnly: true
  
  volumes:
    - name: containerd-certs
      hostPath:
        path: /etc/containerd/certs.d
        type: Directory
  
  enableHost: true

# ==========================================
# dfinit (전체 노드!)
# ==========================================
dfinit:
  enable: true
  restartContainerRuntime: true
  
  # nodeSelector 제거 → 모든 노드 적용!
  # nodeSelector:
  #   dragonfly-phase: test  # 이 줄 삭제!
  
  image:
    repository: nexus.com/dragonflyoss/dfinit
    tag: v0.1.118
    pullPolicy: IfNotPresent
  
  config:
    containerRuntime:
      containerd:
        configPath: /etc/containerd/config.toml
        registries:
          - hostNamespace: nexus.com
            serverAddr: https://nexus.com
            capabilities: ['pull', 'resolve']
          - hostNamespace: docker.io
            serverAddr: https://registry-1.docker.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: gcr.io
            serverAddr: https://gcr.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: ghcr.io
            serverAddr: https://ghcr.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: k8s.gcr.io
            serverAddr: https://k8s.gcr.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: quay.io
            serverAddr: https://quay.io
            capabilities: ['pull', 'resolve']
          - hostNamespace: registry.k8s.io
            serverAddr: https://registry.k8s.io
            capabilities: ['pull', 'resolve']

# ==========================================
# 외부 MySQL (동일)
# ==========================================
mysql:
  enable: false

externalMysql:
  migrate: true
  host: mysql.dragonfly-infra.svc.cluster.local
  port: 3306
  username: dragonfly
  password: "DragonflyPassword123!"
  database: dragonfly
  maxOpenConns: 200  # 테스트: 50 → 프로덕션: 200
  maxIdleConns: 50   # 테스트: 10 → 프로덕션: 50
  connMaxLifetime: 3600

# ==========================================
# 외부 Redis (동일)
# ==========================================
redis:
  enable: false

externalRedis:
  addrs:
    - redis.dragonfly-infra.svc.cluster.local:6379
  password: "RedisPassword123!"
  db: 0
  brokerDB: 1
  backendDB: 2

# ==========================================
# 보안
# ==========================================
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000

4-3. 전체 배포 스크립트

# upgrade-to-production.sh
#!/bin/bash

set -e

echo "================================================"
echo "Phase 4: 전체 배포 (100대 노드)"
echo "================================================"

NAMESPACE="dragonfly-system"
RELEASE_NAME="dragonfly"
CHART_PATH="./dragonfly-1.4.15.tgz"
VALUES_FILE="./dragonfly-production-values.yaml"

# 확인 프롬프트
echo ""
echo "⚠️  경고: 전체 노드(100대)로 확장합니다."
echo ""
echo "현재 테스트 상태:"
kubectl get pods -n ${NAMESPACE} -o wide | grep client | wc -l
echo "개의 Client pods"
echo ""
read -p "계속하시겠습니까? (yes/no): " CONFIRM

if [ "$CONFIRM" != "yes" ]; then
  echo "취소되었습니다."
  exit 1
fi

# 백업
echo ""
echo "[1/5] 현재 설정 백업..."
helm get values ${RELEASE_NAME} -n ${NAMESPACE} > dragonfly-test-backup-$(date +%Y%m%d).yaml
echo "✅ 백업 완료: dragonfly-test-backup-$(date +%Y%m%d).yaml"

# 노드 라벨 확인
echo ""
echo "[2/5] 노드 라벨 정리..."
echo "테스트 라벨이 있는 노드:"
kubectl get nodes -l dragonfly-phase=test

read -p "테스트 라벨을 제거하시겠습니까? (yes/no): " REMOVE_LABEL

if [ "$REMOVE_LABEL" = "yes" ]; then
  kubectl label nodes --all dragonfly-phase-
  echo "✅ 라벨 제거 완료"
else
  echo "⚠️ 라벨 유지 (nodeSelector가 제거되므로 상관없음)"
fi

# Helm Upgrade
echo ""
echo "[3/5] Helm Upgrade 실행..."
echo "Manager: 1→3, Scheduler: 1→3, Seed Peer: 2→5, Client: 5→100"
echo ""

helm upgrade ${RELEASE_NAME} ${CHART_PATH} \
  --namespace ${NAMESPACE} \
  --values ${VALUES_FILE} \
  --wait \
  --timeout 20m

echo "✅ Upgrade 완료"

# 확인
echo ""
echo "[4/5] 배포 상태 확인..."
sleep 10

echo "Manager:"
kubectl get pods -n ${NAMESPACE} -l app=dragonfly-manager

echo ""
echo "Scheduler:"
kubectl get pods -n ${NAMESPACE} -l app=dragonfly-scheduler

echo ""
echo "Seed Peer:"
kubectl get pods -n ${NAMESPACE} -l app=dragonfly-seed-client

echo ""
echo "Client (DaemonSet):"
CLIENT_COUNT=$(kubectl get pods -n ${NAMESPACE} -l component=client --no-headers | wc -l)
echo "Total Client pods: ${CLIENT_COUNT}"

if [ "${CLIENT_COUNT}" -lt 90 ]; then
  echo "⚠️ 경고: Client pods가 예상보다 적습니다 (${CLIENT_COUNT} < 100)"
  echo "일부 노드에 Client가 배포되지 않았을 수 있습니다."
fi

# dfinit 확인
echo ""
echo "[5/5] dfinit 상태 확인..."
kubectl get job -n ${NAMESPACE} -l component=dfinit

echo ""
echo "================================================"
echo "Phase 4 전체 배포 완료!"
echo "================================================"
echo ""
echo "다음 단계:"
echo "1. 전체 노드 상태 모니터링"
echo "2. 성능 측정"
echo "3. 이슈 발생 시 롤백: ./rollback-to-test.sh"

profile
bytebliss

0개의 댓글