Fine-tuning + Continual Learning 동시 진행 분석 결과

seongyun·2025년 7월 20일

AI Continual Learning Fine Tuning

Hancom Project

목록 보기

10/12

Fine-tuning + Continual Learning 동시 진행 가능성 분석 결과

현재 GPU 메모리 상황

총 VRAM: 23,028 MiB (22.5 GB)
현재 사용중: 10,546 MiB (10.3 GB)
사용 가능: 12,482 MiB (12.2 GB)
사용률: 45.8%

Continual Learning 메모리 요구사항

기본 컴포넌트:

DeepSeek-Coder 6.7B (4-bit 양자화): ~3,500 MiB
LoRA 어댑터: ~50 MiB
옵티마이저 상태: ~200 MiB
그래디언트 메모리: ~100 MiB
활성화 메모리: ~300 MiB

Continual Learning 추가 요구사항:

EWC (Elastic Weight Consolidation): ~200 MiB
MER (Meta-Experience Replay): ~200 MiB
Fisher Information Matrix: ~100 MiB

총 예상 메모리: ~4,650 MiB (4.5 GB)
배치 처리용 여유 메모리: ~7,832 MiB (7.6 GB)

권장 설정 (config.yaml)

# 메모리 효율적 설정
batch_size: 8
gradient_accumulation_steps: 4  # 효과적 배치크기: 32
max_length: 2048
use_amp: true
use_gradient_checkpointing: true
target_gpu_util: 0.8
checkpoint_freq: 20

# 학습 설정
max_epochs: 3
learning_rate: 2e-4
weight_decay: 0.01

# Continual Learning 설정
ewc:
  lambda: 5000.0
  normalize: true

mer:
  buffer_size: 5120
  beta: 0.1
  gamma: 0.1

예상 학습 시간

train1.jsonl (30,000 샘플): ~1.5시간 (3 에포크)
train2.jsonl (27,000 샘플): ~1.4시간 (3 에포크)
총 예상 시간: ~3.1시간

cc_train.py의 핵심 최적화 기능들

DynamicBatchSizeManager: OOM 발생시 자동으로 배치 크기 감소
SpotInterruptionHandler: AWS 스팟 인스턴스 중단 감지 및 체크포인트 저장
혼합 정밀도 훈련: torch.amp.autocast로 메모리 절약
그래디언트 누적: 메모리 부족 시 배치를 나누어 처리
4-bit 양자화: BitsAndBytesConfig로 모델 크기 대폭 감소

주의사항 및 모니터링

메모리 모니터링: nvidia-smi로 주기적으로 VRAM 사용량 확인
체크포인트 확인: 20스텝마다 자동 저장되는지 확인
배치 크기 조절: 로그에서 DynamicBatchSizeManager의 자동 조절 확인
스팟 인스턴스: AWS 중단 알림 설정 권장

VRAM 제한 · OOM 방지를 위한 실전 옵션 & 라이브러리

구분	방법·라이브러리	핵심 기능	사용 예시 (CUDA 환경)
1	torch.cuda.set_per_process_memory_fraction (PyTorch ≥ 1.9)	GPU 당 ― 프로세스별 메모리 상한선 설정 (잔여 VRAM 예약)	`python\nimport torch\n# GPU0 메모리의 70%만 사용\ntorch.cuda.set_per_process_memory_fraction(0.7, 0)\n`
2	PYTORCH_CUDA_ALLOC_CONF 환경변수	Caching Allocator 세부 조정·파편화 완화 → OOM 감소	`bash\nexport PYTORCH_CUDA_ALLOC_CONF=\"max_split_size_mb:128,garbage_collection_threshold:0.6\"\npython train.py\n`
3	CUDAPluggableAllocator (PyTorch 2.1+)	외부·사용자 정의 메모리 풀 교체 → 예: UVM / 실시간 스왑 allocator	`python\nfrom torch.cuda.memory import CUDAPluggableAllocator, change_current_allocator\nalloc = CUDAPluggableAllocator(\"./myalloc.so\",\"my_malloc\",\"my_free\")\nchange_current_allocator(alloc)\n`
4	Accelerate `dispatch_model` / `max_memory` ( Accelerate)	모델 로드·추론 시 GPU 별 최대 VRAM (MiB) 한도 지정	`python\nfrom accelerate import init_empty_weights, load_checkpoint_and_dispatch\nmax_mem = {0: \"19000MiB\"}\nmodel = load_checkpoint_and_dispatch(cfg,max_memory=max_mem)\n`
5	DeepSpeed ZeRO-Offload	옵티마이저·그라디언트 CPU offload → VRAM 대폭 절감	`deepspeed --zero_stage 3 --offload_param --offload_optimizer ...`
6	xFormers / Flash-Attention	메모리 절약형 attention 커널 → sequence 모델 VRAM 40–60%↓	`from transformers import AutoModel; model = AutoModel.from_pretrained(..., attn_implementation=\"flash_attention_2\")`
7	Gradient Checkpointing (HF Transformers `gradient_checkpointing_enable`)	중간 activation 재계산 → VRAM ≈½	`python\nmodel.gradient_checkpointing_enable()\n`
8	GPUtil (+ pynvml) 모니터링 콜백	실시간 VRAM 모니터링 → 임계치 도달 시 `torch.cuda.empty_cache()` · batch 축소	`python\nimport GPUtil, torch\nif GPUtil.getGPUs()[0].memoryUtil > .85:\n torch.cuda.empty_cache()\n`

Spring AI Prompt Engineering Patterns

다음 포스트

Fine-tuning + Continual Learning 동시 진행 분석 결과

Hancom Project

Fine-tuning + Continual Learning 동시 진행 가능성 분석 결과

현재 GPU 메모리 상황

Continual Learning 메모리 요구사항

권장 설정 (config.yaml)

예상 학습 시간

cc_train.py의 핵심 최적화 기능들

주의사항 및 모니터링

VRAM 제한 · OOM 방지를 위한 실전 옵션 & 라이브러리

추천 조합

Spring AI Prompt Engineering Patterns

LoRA + PEFT

0개의 댓글