

Nvidia Triton Inference Server을 통해 사용할 수 있는 여러 기능을 정리한 글입니다.
ensemble_scheduling > step에 각 부분을 연결하는 파이프라인 정의. 각 모델의 input_map / output_map의 key-value 값 설정으로 정의 할 수 있음key는 각 모델의 input/output 이름으로 설정value는 해당 ensemble 모델 config.pbtxt 에서 작성한 input / output 이름으로 설정model_name을 설정해주어야 함# config.pbtxt
name: "ensemble_python_resnet50"
platform: "ensemble"
max_batch_size: 256
input [
{
name: "INPUT"
data_type: TYPE_UINT8
dims: [ -1 ]
}
]
output [
{
name: "OUTPUT"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
ensemble_scheduling {
step [
{
model_name: "preprocess"
model_version: -1
input_map {
key: "INPUT_0"
value: "INPUT"
}
output_map {
key: "OUTPUT_0"
value: "preprocessed_image"
}
},
{
model_name: "resnet50_trt"
model_version: -1
input_map {
key: "input"
value: "preprocessed_image"
}
output_map {
key: "output"
value: "OUTPUT"
}
}
]
}
the models composing the ensemble may also have dynamic batching enabled. Since ensemble models are just routing the data between composing models, Triton can take requests into an ensemble model without modifying the ensemble's configuration to exploit the dynamic batching of the composing models. (참고자료)
dynamic_batching {
preferred_batch_size: [ 4 ]
max_queue_delay_microseconds: 3000000 # microseconds임을 유의
}

# 예시1
instance_group [
{
count: 2
kind: KIND_GPU
}
]
# 예시2
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 2
kind: KIND_GPU
gpus: [ 1, 2 ]
}
]어떤 모델은 첫번째 추론 요청(또는 처음 몇개의 추론요청)을 받을때 초기화를 마저 완료하기도 한다. 이런 경우 첫번째 추론 요청을 처리할때 상당히 응답 시간이 길다.
→ 이런 문제를 해결하기 위해 model이 “warmed up” 할 수 있도록 기능을 지원. 모델을 로드한 후, 자동으로 임의의 요청을 실행하여 모델 초기화가 완전히 완료되도록 한다. 이를 통해 실사용자가 첫 요청을 날리기 전에 모델 초기화를 완전히 끝내도록 하는 것이다.
ModelWarmup 설정을 통해, 각 모델 인스턴스를 warm-up할 inference requests을 정의 할 수 있다. 이 inference requests이 완전히 완료되면, model 인스턴스가 serving이 가능한 상태가 된다.
model_warmup [
{
name : "sample text"
batch_size: 1
inputs {
key: "query"
value: {
data_type: TYPE_STRING
dims: [1]
zero_data: true # 문자열(TYPE_STRING) zero_data로 설정하면 빈문자열('') 쿼리를 날림
}
}
}]--pefcentile 를 통해, 설정한 confidence level만큼의 결과를 도출할 수도 있음--pefcentitle 을 95로 설정한 경우 결과 예시*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using p95 latency
Request concurrency: 1
Client:
Request count: 26376
Throughput: 5275.2 infer/sec
p50 latency: 108 usec
p90 latency: 421 usec
p95 latency: 465 usec
p99 latency: 500 usec
Avg HTTP time: 180 usec (send/recv 17 usec + response wait 163 usec)
Server:
Inference count: 32842
Execution count: 32842
Successful request count: 32842
Avg request latency: 90 usec (overhead 11 usec + queue 6 usec + compute input 6 usec + compute infer 63 usec + compute output 4 usec)
Inferences/Second vs. Client p95 Batch Latency
Concurrency: 1, throughput: 5275.2 infer/sec, latency 465 usec--percentile을 설정하지 않은 경우 결과 예시*** Measurement Settings ***
Batch size: 1
Using "time_windows" mode for stabilization
Measurement window: 5000 msec
Using synchronous calls for inference
Stabilizing using average latency
Request concurrency: 1
Client:
Request count: 19739
Throughput: 3947.8 infer/sec
Avg latency: 252 usec (standard deviation 155 usec)
p50 latency: 199 usec
p90 latency: 467 usec
p95 latency: 491 usec
p99 latency: 518 usec
Avg HTTP time: 253 usec (send/recv 24 usec + response wait 229 usec)
Server:
Inference count: 23346
Execution count: 23346
Successful request count: 23346
Avg request latency: 130 usec (overhead 14 usec + queue 9 usec + compute input 9 usec + compute infer 92 usec + compute output 6 usec)
Failed to obtain stable measurement within 10 measurement windows for concurrency 1. Please try to increase the --measurement-interval.
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 3947.8 infer/sec, latency 252 usecp 90 latency 결과가 404라면, 요청의 90%가 404 sec 안에 응답이 반환된다는 의미# 랜덤 input 데이터 넣음
perf_analyzer -m <model name> -u <url> --concurrency-range 2 --shape QUERY:1 --measurement-interval=10
000 --measurement-mode=time_windows
server queueserver compute input/infer/outputclient send/revclient request/responseBy default perf_analyzer measures your model’s latency and throughput using the lowest possible load on the model. To do this perf_analyzer sends one inference request to Triton and waits for the response. When that response is received, the perf_analyzer immediately sends another request, and then repeats this process during the measurement windows. The number of outstanding inference requests is referred to as the request concurrency, and so by default perf_analyzer uses a request concurrency of 1. (본문)
--concurrency-range 옵션으로 설정할 수 있음-f 를 이용해서 csv output 파일 생성한 후, this spreadsheet에 데이터를 넣으면 Visualizing 됨--input-data 옵션에 해당 파일명을 넣으면 파일 내 데이터를 이용해 평가 가능# --input-data 파라미터 사용
perf_analyzer -m <model name> -u localhost:8000 --concurrency-range 4 --input-data realinputdata.json트리톤은 gpu와 요청 지표를 확인 할 수 있는 Prometheus metrics를 제공
curl localhost:8002/metrics) 결과 예시# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{model="ensemble",version="1"} 981.000000
nv_inference_request_success{model="opt",version="1"} 981.000000
nv_inference_request_success{model="preprocess",version="1"} 1000.000000
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{model="ensemble",version="1"} 0.000000
nv_inference_request_failure{model="opt",version="1"} 0.000000
nv_inference_request_failure{model="preprocess",version="1"} 0.000000
# HELP nv_inference_count Number of inferences performed (does not include cached requests)
# TYPE nv_inference_count counter
nv_inference_count{model="ensemble",version="1"} 981.000000
nv_inference_count{model="opt",version="1"} 981.000000
nv_inference_count{model="preprocess",version="1"} 1000.000000
# HELP nv_inference_exec_count Number of model executions performed (does not include cached requests)
# TYPE nv_inference_exec_count counter
nv_inference_exec_count{model="ensemble",version="1"} 981.000000
nv_inference_exec_count{model="opt",version="1"} 325.000000
nv_inference_exec_count{model="preprocess",version="1"} 314.000000
# HELP nv_inference_request_duration_us Cumulative inference request duration in microseconds (includes cached requests)
# TYPE nv_inference_request_duration_us counter
nv_inference_request_duration_us{model="ensemble",version="1"} 1709628307.000000
nv_inference_request_duration_us{model="opt",version="1"} 975163426.000000
nv_inference_request_duration_us{model="preprocess",version="1"} 788530845.000000
# HELP nv_inference_queue_duration_us Cumulative inference queuing duration in microseconds (includes cached requests)
# TYPE nv_inference_queue_duration_us counter
nv_inference_queue_duration_us{model="ensemble",version="1"} 50.000000
nv_inference_queue_duration_us{model="opt",version="1"} 492059181.000000
nv_inference_queue_duration_us{model="preprocess",version="1"} 578120857.000000
# HELP nv_inference_compute_input_duration_us Cumulative compute input duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_input_duration_us counter
nv_inference_compute_input_duration_us{model="ensemble",version="1"} 40594.000000
nv_inference_compute_input_duration_us{model="opt",version="1"} 25458.000000
nv_inference_compute_input_duration_us{model="preprocess",version="1"} 15287.000000
# HELP nv_inference_compute_infer_duration_us Cumulative compute inference duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_infer_duration_us counter
nv_inference_compute_infer_duration_us{model="ensemble",version="1"} 693276368.000000
nv_inference_compute_infer_duration_us{model="opt",version="1"} 483013914.000000
nv_inference_compute_infer_duration_us{model="preprocess",version="1"} 210313689.000000
# HELP nv_inference_compute_output_duration_us Cumulative inference compute output duration in microseconds (does not include cached requests)
# TYPE nv_inference_compute_output_duration_us counter
nv_inference_compute_output_duration_us{model="ensemble",version="1"} 138354.000000
nv_inference_compute_output_duration_us{model="opt",version="1"} 61916.000000
nv_inference_compute_output_duration_us{model="preprocess",version="1"} 77698.000000
# HELP nv_cache_num_entries Number of responses stored in response cache
# TYPE nv_cache_num_entries gauge
# HELP nv_cache_num_lookups Number of cache lookups in response cache
# TYPE nv_cache_num_lookups gauge
# HELP nv_cache_num_hits Number of cache hits in response cache
# TYPE nv_cache_num_hits gauge
# HELP nv_cache_num_misses Number of cache misses in response cache
# TYPE nv_cache_num_misses gauge
# HELP nv_cache_num_evictions Number of cache evictions in response cache
# TYPE nv_cache_num_evictions gauge
# HELP nv_cache_lookup_duration Total cache lookup duration (hit and miss), in microseconds
# TYPE nv_cache_lookup_duration gauge
# HELP nv_cache_util Cache utilization [0.0 - 1.0]
# TYPE nv_cache_util gauge
# HELP nv_cache_num_hits_per_model Number of cache hits per model
# TYPE nv_cache_num_hits_per_model counter
nv_cache_num_hits_per_model{model="ensemble",version="1"} 0.000000
nv_cache_num_hits_per_model{model="opt",version="1"} 0.000000
nv_cache_num_hits_per_model{model="preprocess",version="1"} 0.000000
# HELP nv_cache_hit_lookup_duration_per_model Total cache hit lookup duration per model, in microseconds
# TYPE nv_cache_hit_lookup_duration_per_model counter
nv_cache_hit_lookup_duration_per_model{model="ensemble",version="1"} 0.000000
nv_cache_hit_lookup_duration_per_model{model="opt",version="1"} 0.000000
nv_cache_hit_lookup_duration_per_model{model="preprocess",version="1"} 0.000000
# HELP nv_gpu_utilization GPU utilization rate [0.0 - 1.0)
# TYPE nv_gpu_utilization gauge
nv_gpu_utilization{gpu_uuid="GPU-55d4-21c2-83f9-2c620551ea46"} 0.870000
# HELP nv_gpu_memory_total_bytes GPU total memory, in bytes
# TYPE nv_gpu_memory_total_bytes gauge
nv_gpu_memory_total_bytes{gpu_uuid="GPU-55d4-21c2-83f9-2c620551ea46"} 8589934592.000000
# HELP nv_gpu_memory_used_bytes GPU used memory, in bytes
# TYPE nv_gpu_memory_used_bytes gauge
nv_gpu_memory_used_bytes{gpu_uuid="GPU-55d4-21c2-83f9-2c620551ea46"} 4989124608.000000
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-55d4-21c2-83f9-2c620551ea46"} 87.707000
# HELP nv_gpu_power_limit GPU power management limit in watts
# TYPE nv_gpu_power_limit gauge
nv_gpu_power_limit{gpu_uuid="GPU-55d4-21c2-83f9-2c620551ea46"} 240.000000
# HELP nv_energy_consumption GPU energy consumption in joules since the Triton Server started
# TYPE nv_energy_consumption counter
nv_energy_consumption{gpu_uuid="GPU-55d4-21c2-83f9-2c620551ea46"} 4846806.151000| 옵션 설명 | 옵션 명령 |
|---|---|
| metrics 사용 유무 | tritonserver --allow-metrics=true/false |
| gpu metrics 끄기 | tritonserver --allow-gpu-metrics=false |
| cpu metrics 끄기 | tritonserver --allow-cpu-metrics=false |
| port 설정 | --metrics-port |
| specific address (http servier) 설정 | --http-address |
| 지표 조회 기간 설정 | --metrics-interval-ms |