windows driver npu runtime dll 만들기

wangki·2026년 5월 19일

C Rust TPU coral cpp dll driver gasket google kmdf libedgetpu npu python runtime wdf windows

windows_driver_npu

목록 보기

9/15

기존에는 console application에 mvp로 개발을 하였다. dll로 만들어서 여러 언어에서 사용할 수 있도록 개발할 예정이다. dll 개발 경험이 있어서 수월하게 만들 수 있었다.

기능

runtime dll의 가장 큰 역할은 npu_driver와의 통신을 래핑 하는 것이다. 즉 IOCTL 명령을 추상화하여 사용하도록 한다. 크게 3가지 기능이 필요할 것 같다.

init(memory allocate)
infer (descriptor submit)
free (memory free)

일단 동작을 하는지 확인하기 위해 만드는 것이므로 범용적이라기보다는 SSD model을 포커싱하여 만들었다.

기능 구현

기존 코드를 거의 복사해서 붙여 넣었다. 헤더의 경우 아래와 같이 선언해 주었다.

// Initialize a runtime instance for the face SSD model.
//   model_path_utf8   : UTF-8 path to ssd_mobilenet_v2_face_*.tflite
//   anchors_bin_path_utf8 : UTF-8 path to anchors.bin (float32 [num_anchors][4])
//   out_handle        : on success, receives a non-NULL handle
NPU_API npu_status_t npu_runtime_init_face_ssd(
    const char*   model_path_utf8,
    const char*   anchors_bin_path_utf8,
    npu_handle_t* out_handle);

// Query model input dimensions (w, h, c). Bytes = w * h * c.
NPU_API npu_status_t npu_runtime_get_input_size(
    npu_handle_t h,
    uint32_t* out_w,
    uint32_t* out_h,
    uint32_t* out_c);

// Run one inference.
//   image_rgb       : RGB uint8 packed, must be exactly w*h*c bytes
//   image_byte_len  : length of image_rgb (sanity check)
//   out_dets        : caller-allocated array of `max` entries (may be NULL if max==0)
//   max             : capacity of out_dets
//   out_count       : actually written (always <= max)
NPU_API npu_status_t npu_runtime_infer_image(
    npu_handle_t     h,
    const uint8_t*   image_rgb,
    uint32_t         image_byte_len,
    npu_detection_t* out_dets,
    uint32_t         max,
    uint32_t*        out_count);

// Release all resources held by the handle (chip buffers, device handle, COM).
// Safe to call with NULL.
NPU_API void npu_runtime_free(npu_handle_t h);

정의 부분은 복잡하기 때문에 아래 github을 확인하면 된다.

실제 사용

카메라를 사용하여 실시간으로 추론을 테스트하기 위해서 python을 선택했다. opencv를 손쉽게 사용할 수 있고 ctypes를 활용해 손쉽게 dll을 사용할 수 있기 때문이다. 또한 agent의 도움을 받아 빠르게 결과를 확인할 수 있다.

정확히 얼굴을 찾는 것을 확인할 수 있다. 그러나 여기서 문제점이 발생했다. 특정 시점이 되면 추론에 실패하고 driver에서 에러를 뱉었다.

문제 해결

[DIE] infer failed at frame=128 infer_count=127
[DIE] uptime since loop start: 0.98s (last interval)
[DIE] error: infer_image failed (status=6): IOCTL_INFER_NEW failed, GetLastError=483
Traceback (most recent call last):
  File "C:\work\real\npu_runtime_camera.py", line 198, in <module>
    sys.exit(main())
  File "C:\work\real\npu_runtime_camera.py", line 140, in main
    dets = npu.infer(rgb, max_dets=50)
  File "C:\work\real\npu_runtime_demo.py", line 115, in infer
    raise RuntimeError(f"infer_image failed (status={st}): {self.last_error()}")

위와 같은 에러 로그가 발생했다. 확인 결과 128프레임에 죽는 현상이 계속해서 발생했다.

한 프레임마다 infer를 하는데 tail이 증가하도록 되어있다. tail에 써주는 csr이 최대 0 ~ 255 까지만 받을 수 있도록 설계되어 있는 것을 확인했다. 코드에

apex_write_register(bar2, APEX_REG_INSTR_QUEUE_TAIL, pDc->DescRingTail);

한 번 제출 시, param cache와 infer descriptor를 2개씩 제출하기 때문에 2씩 증가한다. 따라서 pDc->DescRingTail을 % 256 연산을 해야 한다.

apex_write_register(bar2, APEX_REG_INSTR_QUEUE_TAIL, pDc->DescRingTail % 256);

위처럼 코드를 변경하니 문제가 발생하지 않았다!!!

결론

npu driver가 잘 동작한다!! 너무 기쁘다.

수정 및 개선해야 할 부분이 많다. anchor 및 quant params 값들을 하드 코딩 형태로 사용하고 있는 점도 수정해야 한다. 또한 byte-identical을 위해서 넣어두었던 로깅 코드들도 제거해야 하고 param cache descriptor도 infer 마다 하기 때문에 최초에 한번 하도록 수정해야 한다.

https://github.com/wangki-kyu/npu_driver/tree/main/npu_runtime

wangki

이전 포스트

windows driver npu convert_scores 버그가 아니었다

다음 포스트

windows driver npu runtime dll 만들기