math_test.py)math_test.py):from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen-VL-SFT-finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
prompt = "Solve this math problem: What is the derivative of x^2?"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.cuda()
with torch.no_grad():
outputs = model.generate(input_ids)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
math_test.py를 직접 실행 시 오류 발생IndexError: index out of range in self
../aten/src/ATen/native/cuda/Indexing.cu:1255| 문제상황 | 원인 분석 | 해결방법 |
|---|---|---|
| CUDA Indexing 오류 발생 | transformers 라이브러리 inference 시 token index mismatch 및 tensor 크기 문제 발생 | transformers 대신 최적화된 vLLM 라이브러리를 사용하여 추론 수행 |
pip install vllm
python -m vllm.entrypoints.api_server \
--model Qwen-VL-SFT-finetuned \
--host 0.0.0.0 --port 8000
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Solve this math problem: What is the derivative of x^2?"}'
{
"text": "The derivative of x^2 is 2x."
}
최종 배포 구조는 다음과 같은 형태로 구축:
Frontend (React, port:3000)
↳ Backend API (FastAPI, port:8001)
↳ vLLM Server (port:8000)
main.py):from fastapi import FastAPI
import requests
from pydantic import BaseModel
app = FastAPI()
class PromptRequest(BaseModel):
prompt: str
@app.post("/generate")
def generate_text(request: PromptRequest):
vllm_url = "http://localhost:8000/generate"
payload = {"prompt": request.prompt}
response = requests.post(vllm_url, json=payload)
response_json = response.json()
return {"result": response_json["text"]}
uvicorn main:app --host 0.0.0.0 --port 8001
App.jsx):import React, { useState } from 'react';
import axios from 'axios';
function App() {
const [prompt, setPrompt] = useState('');
const [result, setResult] = useState('');
const handleSubmit = async () => {
const response = await axios.post('http://localhost:8001/generate', { prompt });
setResult(response.data.result);
};
return (
<div>
<textarea value={prompt} onChange={e => setPrompt(e.target.value)} />
<button onClick={handleSubmit}>Submit</button>
<div>{result}</div>
</div>
);
}
export default App;
npm run dev
graph TD
A[React 프론트엔드 3000] --> B[FastAPI API 서버 8001]
B --> C[vLLM API 서버 8000]
C --> B --> A
이러한 과정을 통해, SFT 이후 모델 추론 환경 구축 및 안정적인 배포가 성공적으로 완료됩니다.