[TIL] vLLM 이란 뭘까?

Jaeyoung Ko·2025년 3월 19일

vLLM

vLLM 이란 대규모 언어 모델(LLM) 추론 및 서비스 제공을 위한 오픈소스 라이브러리로,

LLM을 더 쉽고 빠르게 deploy 및 inference와 serving 할 수 있도록 한다.

공식 페이지 : https://docs.vllm.ai/en/latest/index.html

주요 특징

1. PagedAttention 을 통한 효과적인 키-값 메모리 관리

vLLM 의 최고 강점은 PagedAttention 기법을 통해 메모리 관리와 처리 성능의 극대화라는 것이다. 대기 중인 request에 대해 continuous batching 처리를 할 수 있어 고성능의 처리량을 유지하면서 낮은 latency를 보장함으로 실시간 어플리케이션에 유용하게 활용될 수 있다.

2. Quantization

GPTQ, AWQ, INT4, INT8, and FP8

위와 같은 다양한 양자화 기법을 지원하여 모델의 메모리 사용량을 줄이고 GPU 에서 실행 시 성능의 최적화를 도와준다.

3. 확장성, 유연성

vLLM 은 hugging face 와 같은 모델 허브에서 가져올 수 있고 tensor parallelism을 지원하여 여러 GPU에 걸친 모델의 분산 배포가 가능해 대규모 데이터 처리, 추론 작업에 고성능을 제공한다.

예제

공식 홈페이지에 주요 예시들이 존재한다.

하단의 코드는 chat with tools 예제 코드 스니펫이다.


# SPDX-License-Identifier: Apache-2.0

# ruff: noqa
import json
import random
import string

from vllm import LLM
from vllm.sampling_params import SamplingParams

# This script is an offline demo for function calling
#
# If you want to run a server/client setup, please follow this code:
#
# - Server:
#
# ```bash
# vllm serve mistralai/Mistral-7B-Instruct-v0.3 --tokenizer-mode mistral --load-format mistral --config-format mistral
# ```
#
# - Client:
#
# ```bash
# curl --location 'http://<your-node-url>:8000/v1/chat/completions' \
# --header 'Content-Type: application/json' \
# --header 'Authorization: Bearer token' \
# --data '{
#     "model": "mistralai/Mistral-7B-Instruct-v0.3"
#     "messages": [
#       {
#         "role": "user",
#         "content": [
#             {"type" : "text", "text": "Describe this image in detail please."},
#             {"type": "image_url", "image_url": {"url": "https://s3.amazonaws.com/cms.ipressroom.com/338/files/201808/5b894ee1a138352221103195_A680%7Ejogging-edit/A680%7Ejogging-edit_hero.jpg"}},
#             {"type" : "text", "text": "and this one as well. Answer in French."},
#             {"type": "image_url", "image_url": {"url": "https://www.wolframcloud.com/obj/resourcesystem/images/a0e/a0ee3983-46c6-4c92-b85d-059044639928/6af8cfb971db031b.png"}}
#         ]
#       }
#     ]
#   }'
# ```
#
# Usage:
#     python demo.py simple
#     python demo.py advanced

model_name = "mistralai/Mistral-7B-Instruct-v0.3"
# or switch to "mistralai/Mistral-Nemo-Instruct-2407"
# or "mistralai/Mistral-Large-Instruct-2407"
# or any other mistral model with function calling ability

sampling_params = SamplingParams(max_tokens=8192, temperature=0.0)
llm = LLM(model=model_name,
          tokenizer_mode="mistral",
          config_format="mistral",
          load_format="mistral")


def generate_random_id(length=9):
    characters = string.ascii_letters + string.digits
    random_id = ''.join(random.choice(characters) for _ in range(length))
    return random_id


# simulate an API that can be called
def get_current_weather(city: str, state: str, unit: 'str'):
    return (f"The weather in {city}, {state} is 85 degrees {unit}. It is "
            "partly cloudly, with highs in the 90's.")


tool_funtions = {"get_current_weather": get_current_weather}

tools = [{
    "type": "function",
    "function": {
        "name": "get_current_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type":
                    "string",
                    "description":
                    "The city to find the weather for, e.g. 'San Francisco'"
                },
                "state": {
                    "type":
                    "string",
                    "description":
                    "the two-letter abbreviation for the state that the city is"
                    " in, e.g. 'CA' which would mean 'California'"
                },
                "unit": {
                    "type": "string",
                    "description": "The unit to fetch the temperature in",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["city", "state", "unit"]
        }
    }
}]

messages = [{
    "role":
    "user",
    "content":
    "Can you tell me what the temperate will be in Dallas, in fahrenheit?"
}]

outputs = llm.chat(messages, sampling_params=sampling_params, tools=tools)
output = outputs[0].outputs[0].text.strip()

# append the assistant message
messages.append({
    "role": "assistant",
    "content": output,
})

# let's now actually parse and execute the model's output simulating an API call by using the
# above defined function
tool_calls = json.loads(output)
tool_answers = [
    tool_funtions[call['name']](**call['arguments']) for call in tool_calls
]

# append the answer as a tool message and let the LLM give you an answer
messages.append({
    "role": "tool",
    "content": "\n\n".join(tool_answers),
    "tool_call_id": generate_random_id(),
})

outputs = llm.chat(messages, sampling_params, tools=tools)

print(outputs[0].outputs[0].text.strip())
# yields
#   'The weather in Dallas, TX is 85 degrees fahrenheit. '
#   'It is partly cloudly, with highs in the 90's.'

Jaeyoung Ko

안녕하세요, 고재영입니다. 언제나 즐겁게 살려고 노력합니다.

[TIL] vLLM 이란 뭘까?

vLLM

주요 특징

1. PagedAttention 을 통한 효과적인 키-값 메모리 관리

2. Quantization

3. 확장성, 유연성

예제

[UE5] EQS 학습/적용해보자

[CG] 그래픽스 총정리 : Recap and 추가 공부

0개의 댓글