Triton을 활용한 Inference 서버 세팅

UNGGI LEE·2024년 6월 27일

머신러닝 분야에서 연구하시는 분들은 모델링하고 실험하는 것에는 익숙하지만, 모델을 Inference 하거나 서버를 구축하는데 어려움을 겪는 경우가 있습니다.

Decoder 기반의 LLM의 경우에는 vllm과 같은 Inference 가속화 툴들이 많습니다. 그러나 여전히 모든 분야에서 LLM만 사용하는 것은 아니고, Encoder 기반의 모델들이나, Transformer가 아닌 모델들도 여전히 많이 사용되고 있습니다.

이럴때 쉽고 편하게 사용할 수 있는 Inference 도구로는 Nvidia Triton이 있습니다. Triton 자체에 대해서도 공부할 것이 꽤 많습니다만, 이 글에서는 Triton 서버를 설정하고 사용하는 방법 위주로 정리해보겠습니다.

1. 모델 변환

첫 단계는 모델을 Triton이 이해할 수 있는 형식으로 변환하는 것입니다. Triton에서는 PyTorch, TensorRT, ONNX 등 다양한 포맷을 지원합니다.

여기서는 Hugging Face의 PyTorch 모델을 ONNX 형식으로 변환하는 예를 들겠습니다.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "MyModel/MyModel"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 입력 예제 생성
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# ONNX로 변환
torch.onnx.export(
    model,
    (inputs['input_ids'], inputs['attention_mask']),
    "model.onnx",
    input_names=["input_ids", "attention_mask"],
    output_names=["logits"],
    dynamic_axes={
        "input_ids": {0: "batch_size", 1: "sequence_length"},
        "attention_mask": {0: "batch_size", 1: "sequence_length"},
        "logits": {0: "batch_size", 1: "sequence_length"}
    }
)

모델을 onnx로 변환하는 것은 매우 간단합니다. 여기서 중요한건 모델의 입력과 출력을 위한 데이터의 차원을 정의하는 것 입니다. 이 과정만 잘 지키면 어렵지 않게 onnx로 모델을 변환할 수 있습니다.

2. Triton 폴더 구조 설정

Triton은 특정 폴더 구조를 요구합니다. model_repository라는 폴더 아래에 config.pbtxt와 내 프로젝트 폴더(여기서는 my_model)가 필요합니다. 그 아래에는 1, 2, 3과 같은 번호로된 폴더를 만들고 그 안에 위에서 변환한 model.onnx 파일을 넣어줍니다.
번호로된 폴더는 각각 다른 모델을 서빙할 수 있게 해주며, Triton 서버를 배포하면 각각 다른 endpoint로 접근할 수 있게 됩니다.

model_repository/
├── config.pbtxt
└── my_model/
    └── 1/
        └── model.onnx

3. config.pbtxt 작성

config.pbtxt 파일은 모델 설정은 Triton에서 핵심적인 부분입니다. 여기서 backend는 onnxruntime_onnx를 사용했습니다. 아래와 같이 platform에 onnxruntime_onnx를 사용하면 됩니다.

max_batch_size는 모델에 최대 입력할 수 있는 배치사이즈를 정하는 것입니다. 최대 배치사이즈이므로, 이것보다 작은 배치사이즈로 요청이 들어와도 대응할 수 있습니다. 다만 이걸 넘어서는 배치사이즈는 해당 서버에서 처리할 수 없습니다.

여기서 신경써야 할 것은 input과 output의 차원 정도입니다. 그리고 instance_group에서는 이 모델이 cpu를 사용할 것인지, gpu를 사용할 것인지 정의할 수 있습니다.

instance_group의 count는 Triton 서버에서 Instance를 총 몇 가지 사용할 수 있는지에 대한 정의입니다. 서버의 사양에 따라 달라질 수 있는 값입니다.

name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [
  {
    name: "input_ids"
    data_type: TYPE_INT64
    dims: [ -1 ]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT64
    dims: [ -1 ]
  }
]
output [
  {
    name: "logits"
    data_type: TYPE_FP32
    dims: [ -1, 1 ]
  }
]
instance_group [
  {
    kind: KIND_CPU # GPU는 KIND_GPU
    count: 1
  }
]

4. Dockerfile 작성

Triton 서버를 컨테이너화하려면 Dockerfile이 필요합니다. 물론 로컬에 그냥 설치할 수도 있지만, Nvidia 측에서도 Docker 환경으로 구성하길 권장하고 있습니다. 아래와 같이 작성하면 됩니다.

FROM nvcr.io/nvidia/tritonserver:23.02-py3
RUN mkdir -p /models
COPY model_repository /models
EXPOSE 8000 8001 8002
CMD ["tritonserver", "--model-repository=/models", "--log-verbose=1"]

docker-compose 를 사용해서 위 환경을 동작시키려면, 아래와 같이 docker-compose.yml을 만들면 됩니다.

CPU 환경의 예시입니다.

version: '3.9'
services:
  triton:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
      - "8001:8001"
      - "8002:8002"
    shm_size: '1gb'
    ulimits:
      memlock: -1
      stack: 67108864

GPU 환경에서는 이렇게 작성하면 됩니다.

version: '3.9'

services:
  triton:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
      - "8001:8001"
      - "8002:8002"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      NVIDIA_DRIVER_CAPABILITIES: compute,utility
    shm_size: '1gb'
    ulimits:
      memlock: -1
      stack: 67108864

실행은 다들 아시겠지만, 아래와 같이 실행하면 됩니다.

# 백그라운드로 실행
docker compose up --build -d

# 로그를 확인하고 싶다면
docker compose logs -f

마무리

이렇게 설정하면 Triton Inference 서버가 준비됩니다. 8000번 포트로 HTTP API를, 8001번 포트로 gRPC를 사용할 수 있으며, 8002번 포트는 모니터링 도구(Grafana 등..) 연결에 사용됩니다.

Inference만을 위해 편하게 사용하고 싶다면, 8000번 포트를 사용하시면 됩니다.

Endpoint는 대략 아래와 같습니다. my_model은 프로젝트 명이니, 폴더에 설정한 이름을 사용하면 됩니다.

http://000.000.005.000:8000/v2/models/my_model/versions/1/infer

UNGGI LEE

NLP Researcher & ML Engineer @i-Scream Edu

이전 포스트

Triton을 활용한 Inference 서버 세팅

1. 모델 변환

2. Triton 폴더 구조 설정

3. config.pbtxt 작성

4. Dockerfile 작성

마무리

[Flutter] Mac 환경변수 설정

0개의 댓글