우당탕탕 NPU 사용기

Minhan Cho·2025년 10월 20일

npu 업무

본 문서는 NPU 사용기한인 2025.12.31.까지 꾸준히 업데이트될 예정임!

NPU란 무엇인가?

NPU(Neural Processing Unit)은 '신경망 연산 전용 가속기'임. GPU 대신 딥러닝 기본 연산이 전력 대비 성능 측면에서 최적화되었음
목표: GPU가 모델 학습부터 그래픽 처리까지 범용적으로 쓰이지만, 대형 모델의 정밀도(e.g. int8, int4)에서 최적화되지 않아 여기에 포커싱하여 개선하고자 함.
- 따라서 quantization, 낮은 정밀도 세팅(e.g. int8, int4, fp8)에 친화/특화되었음
- GPU: 드라이버가 커널 단위로 스케줄 / NPU: 컴파일러가 연산·메모리·데이터흐름을 선계획(prefill/decode 분리, KVCache 배치 등). $\rightarrow$ 컴파일러가 필요함! GPU와는 다른 점!
GPU 사용할 때와 NPU 사용할 때의 차이
- 컴파일링: GPU는 모델 로드 후 곧바로 추론 가능(딱히 컴파일 불필요)하나, NPU는 컴파일 $\rightarrow$ 런타임 로드의 2단계를 거침
- 정적 셰이프 요구: NPU는 max_seq_len, batch, KV 파티션 등 고정/상한을 컴파일 시 정해 최적화. GPU는 다이내믹 길이에 관대.
- KVCache 전략: NPU는 온칩/오프로딩/파티셔닝을 컴파일러가 계획(예: rbln_kvcache_partition_len). GPU는 런타임 텐서로 비교적 유연.
- 양자화 적합성: NPU는 INT8/FP8 전제 최적화 $\rightarrow$ 사전 양자화/스케일 튜닝이 성능 관건. GPU는 FP16/BF16로도 충분한 경우 많음.
- 성능 특성: NPU는 정적 그래프+온칩 재사용 덕에 토큰당 지연시간(특히 디코딩) 과 전력 효율이 좋음. GPU는 대규모 배치 처리량에 강점.
- 오류 양상: NPU는 런타임·컨텍스트 개념이 강함(아래 오류 사례 참조!) $\rightarrow$ 중복 런타임 생성, 컨텍스트 잔존(드라이버 메모리 점유) 이슈가 흔함. GPU는 주로 OOM/커널 실패.
- 운영 도구: GPU는 nvidia-smi, NPU는 벤더 도구(예: rbln-stat)로 장치/컨텍스트/메모리를 확인하고 정리·재기동 절차가 중요 (이것 때문에 공식 문서를 들여다볼 일이 많음..)

사용환경

cloud: elice cloud
NPU: Rebellion ATOM+ (6vCPU, 72GB RAM)

NPU Howto

source: rebellion quick start

위 quick start를 따라가면 됨

주의사항

rebellion compiler를 설치해야 하는데, 이를 위해 rebellion portal에 가입해야 한다. 가입신청 넣고 한 이틀 기다린 듯

pip3 install -i https://pypi.rbln.ai/simple/ rebel-compiler

왜인지 모르겠는데 uv 환경과는 꼬이는 듯하다. 오랜만에 venv로 가상환경 activate.

Error log

1. NPU runtime init 에러 (해결)

실행 코드: 출처

from transformers import AutoTokenizer
from optimum.rbln import RBLNAutoModelForCausalLM

# Compile and export the model
model_id = "Qwen/Qwen3-1.7B"
model_save_dir = "rbln-Qwen3-1.7B"

# 1) 컴파일 수행
model = RBLNAutoModelForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,  # export a PyTorch model to RBLN model with optimum
    rbln_batch_size=1,
    rbln_max_seq_len=40_960,  # default "max_position_embeddings"
    rbln_attn_impl="flash_attn",
    rbln_kvcache_partition_len=8192,  # Length of KV cache partitions for flash attention
)

# Save the compiled model to disk
model.save_pretrained(model_save_dir)

del model
gc.collect()

# 2) Load the compiled model & 런타임 생성
model = RBLNAutoModelForCausalLM.from_pretrained(
    model_id=model_save_dir,
    export=False, # 이미 컴파일 되어있으므로 필요 없음
    rbln_batch_size=1,
    rbln_max_seq_len=40_960,
    rbln_attn_impl="flash_attn",
    rbln_kvcache_partition_len=8192,
)

# 3) 이후 inference 과정.. -> 생략

에러 로그

elicer@377f0f8a74dd:~/llm_test$ source venv_llm/bin/activate
(venv_llm) elicer@377f0f8a74dd:~/llm_test$ python sllm_test.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:46<00:00, 23.21s/it]
2025-10-19 05:09:32,375 INFO [optimum.rbln] [KVCache] Compiling with num_blocks: 5
2025-10-19 05:09:36,118 INFO [rebel-compiler] Export done. Elapsed time: 0:00:03
2025-10-19 05:09:56,532 INFO [rebel-compiler] Exported model conversion done. Elapsed time: 0:00:13, Memory change: 2.27 MB
2025-10-19 05:09:56,648 INFO [rebel-compiler] RBLN SDK compiler version: 0.9.1
2025-10-19 05:09:56,649 INFO [rebel-compiler] -- Target NPU: RBLN-CA22
2025-10-19 05:09:56,649 INFO [rebel-compiler] -- Tensor parallel size: 1
2025-10-19 05:09:56,657 INFO [rebel-compiler] +------------------------------------------------+
2025-10-19 05:09:56,657 INFO [rebel-compiler] |Compile(#0), mod_name=bcacb2, input_info_index=0|
2025-10-19 05:09:56,657 INFO [rebel-compiler] +------------------------------------------------+
Computation graph generation ████████████████████████████████████████ 100% 00:16
Computation graph optimization  ████████████████████████████████████████ 100% 00:182025-10-19 05:10:45,456 INFO [rebel-compiler] Export done. Elapsed time: 0:00:02
2025-10-19 05:11:05,681 INFO [rebel-compiler] Exported model conversion done. Elapsed time: 0:00:13, Memory change: 0.00 MB
2025-10-19 05:11:05,799 INFO [rebel-compiler] RBLN SDK compiler version: 0.9.1
2025-10-19 05:11:05,800 INFO [rebel-compiler] -- Target NPU: RBLN-CA22
2025-10-19 05:11:05,800 INFO [rebel-compiler] -- Tensor parallel size: 1
2025-10-19 05:11:05,801 INFO [rebel-compiler] +------------------------------------------------+
2025-10-19 05:11:05,801 INFO [rebel-compiler] |Compile(#1), mod_name=bcacb2, input_info_index=1|
2025-10-19 05:11:05,801 INFO [rebel-compiler] +------------------------------------------------+
Computation graph generation ████████████████████████████████████████ 100% 00:14
Computation graph optimization  ████████████████████████████████████████ 100% 00:022025-10-19 05:11:27,709 INFO [rebel-compiler] Serializing compiled model to /tmp/tmpdw4ulk26/prefill.rbln ...
2025-10-19 05:11:31,758 INFO [rebel-compiler] Compiled model serialized. Elasped time: 0:00:04
2025-10-19 05:11:31,759 INFO [rebel-compiler] Serializing compiled model to /tmp/tmpdw4ulk26/decoder_batch_1.rbln ...
2025-10-19 05:11:32,445 INFO [rebel-compiler] Compiled model serialized. Elasped time: 0:00:00
2025-10-19 05:11:42,593 INFO [rebel-compiler] Load model completed. Elasped time: 0:00:03
2025-10-19 05:11:43,526 INFO [rebel-compiler] Load model completed. Elasped time: 0:00:00
2025-10-19 05:11:43,546 ERROR [rebel-compiler] 
RBLNRuntimeError: 
Failed to create RBLN runtime: INIT_ALREADY_CREATED: A runtime has already been created for that compiled model (Context failed to be created, compile_id=0). Try creating a runtime on a different NPU(s), or use an existing runtime.

If you only need to compile the model without loading it to NPU, you can use:
  from_pretrained(..., rbln_create_runtimes=False) or
  from_pretrained(..., rbln_config={..., 'create_runtimes': False})

To check your NPU status, run the 'rbln-stat' command in your terminal.
Make sure your NPU is properly installed and operational.

분석
- 모델을 다운로드 받고(여기에서는 생략되어있음), compile하는 과정은 문제 없음([rebel-compile] 참조)
- 이후 .from_pretrained() 여기서 컴파일된 모델을 불러올 때 생기는 문제
해결책

import gc

from transformers import AutoTokenizer
from optimum.rbln import RBLNAutoModelForCausalLM

# Compile and export the model
model_id = "Qwen/Qwen3-1.7B"
model_save_dir = "rbln-Qwen3-1.7B"

# 1) 컴파일만 수행 (런타임 생성 X)
# rbln_create_runtimes=False or rbln_config={'create_runtimes': False} -> 이거 안 하면 런타임이 생성되고, 아래랑 겹치면서 에러 남!
model = RBLNAutoModelForCausalLM.from_pretrained(
    model_id=model_id,
    export=True,  # export a PyTorch model to RBLN model with optimum
    rbln_batch_size=1,
    rbln_max_seq_len=40_960,  # default "max_position_embeddings"
    rbln_attn_impl="flash_attn",
    rbln_kvcache_partition_len=8192,  # Length of KV cache partitions for flash attention
    rbln_create_runtimes=False # <- 이거 추가함!!
)

# Save the compiled model to disk
model.save_pretrained(model_save_dir)

del model
gc.collect()

# 2) Load the compiled model & 런타임 생성
model = RBLNAutoModelForCausalLM.from_pretrained(
    model_id=model_save_dir,
    export=False, # 이미 컴파일 되어있으므로 필요 없음
    rbln_batch_size=1,
    rbln_max_seq_len=40_960,
    rbln_attn_impl="flash_attn",
    rbln_kvcache_partition_len=8192,
)

# 3) 이후 inference 과정.. -> 생략

compile 시 rbln_create_runtimes=False 혹은 rbln_config={'create_runtimes': False} 추가
컴파일 시 rbln_create_runtimes argument를 추가하지 않으면, 컴파일과 함께 NPU 런타임에 모델을 얹어버리기 때문에 이후 inference를 위한 런타임과 충돌이 나버린다. 출처: rebellion 공식문서 model api 페이지의 0.7.1 버전
해결책이 로그에 나와있었지만 warning인 줄 알고 무시했다... 이런!

2. NPU에 NPU killed여도 context 잔존 (해결)

실행 코드: 위와 같음
에러 로그
분석
- 위 compile 시 runtime 연결과 함께되는 에러와 관련하여, NPU가 죽으면 context(CTX)도 비워져야 하는데, 아직 그 내용이 남아있다(Memalloc 참조)
- 1의 문제를 해결(정상적으로 inference)해도 여전히 context는 남아있음 $\rightarrow$ rbln model zoo 예제코드 이거 그냥 갖다 붙여야 하나?
- 시간(하루이틀?)이 지나면 비워지는데(혹은 비워지는 것 같은데), 클라우드 환경이라 당장 NPU를 뗐다 붙였다 할 수 없으니 고역스러움
해결책
- ~~local 아니고 클라우드면 답 없음. 엘리스 클라우드 담당자분께 메일 넣었음.~~
- 코드 캡슐화하면 context 안 남음!

Minhan Cho

multidisciplinary

이전 포스트

ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance (arXiv 2025)

1개의 댓글

interlude

2025년 10월 20일

[비밀댓글입니다.]

답글 달기