Deep dive into vLLM

Brad.Min·2024년 4월 18일

아래의 문서를 참고하여 vllm으로 서빙을 테스트 해보자.
https://docs.vllm.ai/en/latest/getting_started/installation.html

1. 환경 설치

먼저 간단하게 아래의 명령어로 도커를 설치하자. 도커를 설치하여 서빙을 할 수 있는 환경을 만들어 준다고 생각하면 된다.

$ docker run --gpus all -it --rm --ipc=host nvcr.io/nvidia/pytorch:23.10-py3

다음으로는 vllm 패키지를 설치하자.

$ pip install vllm

2. 인퍼런스 테스트 파일

Documentation에 있는 간단한 파이썬 스크립트를 만들어 보자. 아래의 파일은 간단한 facebook/opt-125m 모델을 불러오고 프롬프트를 입력하는 예시이다.

from vllm import LLM, SamplingParams

prompts = ["Hello, my name is",
           "The president of the United States is",
           "The capital of France is",
           "The future of AI is",]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="facebook/opt-125m")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")코드를 입력하세요

[성공 화면]

Brad.Min

민공지능

다음 포스트

Deep dive into vLLM

1. 환경 설치

2. 인퍼런스 테스트 파일

허깅페이스 모델 > TensorRT-LLM 변환

0개의 댓글