model.generate() 실행 코드 정리
import torch
import time
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "meta-llama/Meta-Llama-3-8B"
print("모델 로드 (int8)\n")
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config = bnb_config,
device_map = "auto"
)
print("모델 로드 완료\n")
# input data
text ="Hello, please introduce yourself briefly."
print(f"질문: {text}")
# token화 -> VRAM으로 이동
inputs = tokenizer(text, return_tensors="pt").to("cuda")
# 추론
print("답변 생성 시작\n")
start_time = time.time()
outputs = model.generate(
**inputs,
max_new_tokens=50,
pad_token_id = tokenizer.eos_token_id
)
end_time = time.time()
duration = end_time - start_time
input_len = inputs.input_ids.shape[1]
total_len = outputs.shape[1]
generated_tokens= total_len - input_len
tps = generated_tokens / duration
print(f"걸린 시간: {duration:.2f}초")
print(f"생성된 토큰: {generated_tokens}개")
print(f"속도 (TPS): {tps:.2f} tokens/sec\n")
decoded_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"AI 답변:\n{decoded_text}")
model.generate() 여러번 호출 및 시간 측정

근데 확인해보니까, gpu 사용률이 max로 안올라가고 있다. 근데 P2가 찍혀있는것은 gpu의 rpm은 이미 최고 속도로 돌고 있다는 것을 의미한다. 이는, 데이터가 오는데, 기다리는 시간이 60%라는 것을 의미한다. Memory-Bound (메모리 대역폭 한계로 인한 병목)으로 인해 발생한 현상이다.
이는 구조적으로도 해결할 수 있다! llm 생성은 결국 추론할때에는 순차적으로 추론하기 때문에, 연산 밀도가 매우 낮다. 이를 해결하기 위해 아래와 같은 기술들이 있다. 나중에 꼭 공부하자!
- Speculative Decoding (추측형 디코딩)
- KV Cache 최적화 (PagedAttention, vLLM)
- Flash Attention
추론 시간

웜업을 확인하고 싶었는데, 모델 로드할때 이미 gpu가 활성화 되어 있는것을 확인했다. (P2)
그래서, 60초 대기하고, 실행해봄!!

P8까지 떨어짐! 하지만, 결론적으로는 별차이ㅣ가 없었음.ㅠ
추론속도
1번째 추론 속도 (TPS): 7.63 tokens/sec
2번째 추론 속도 (TPS): 6.94 tokens/sec (최저)
3번째 추론 속도 (TPS): 8.20 tokens/sec
4번째 추론 속도 (TPS): 8.76 tokens/sec (최고)
5번째 추론 속도 (TPS): 7.61 tokens/sec
6번째 추론 속도 (TPS): 7.31 tokens/sec
7번째 추론 속도 (TPS): 8.31 tokens/sec
8번째 추론 속도 (TPS): 7.09 tokens/sec
9번째 추론 속도 (TPS): 7.69 tokens/sec
10번째 추론 속도 (TPS): 8.58 tokens/sec
전체 로그
(pruning) C:\Users\KHH\source\lab\EfficientML\LightweightChallenge>python 005_model_inference.py
모델 로드 (int8)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 4/4 [00:16<00:00, 4.05s/it]
모델 로드 완료
60초 sleep (nvidia-smi로 Perf 확인하기)
... 60초 남음
... 50초 남음
... 40초 남음
... 30초 남음
... 20초 남음
... 10초 남음
1번째 답변 생성 시작
C:\Users\KHH.conda\envs\pruning\lib\site-packages\transformers\models\llama\modeling_llama.py:602: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:555.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
걸린 시간: 6.56초
생성된 토큰: 50개
1번째 추론 속도 (TPS): 7.63 tokens/sec
AI 답변:Hello, please introduce yourself briefly. What are you working on at the moment?
My name is Paulina, I am 27 years old and I am originally from Poland. I am currently working as a freelance graphic designer in Berlin. I am working on a project called, The Art
2번째 답변 생성 시작
걸린 시간: 7.21초
생성된 토큰: 50개
2번째 추론 속도 (TPS): 6.94 tokens/sec
AI 답변:Hello, please introduce yourself briefly. What is your role at the University of Hohenheim?
I am a professor of Agricultural Economics and Management at the University of Hohenheim. I am also the director of the Centre for Development Research (ZEF) at the University of Bonn
3번째 답변 생성 시작
걸린 시간: 6.10초
생성된 토큰: 50개
3번째 추론 속도 (TPS): 8.20 tokens/sec
AI 답변:Hello, please introduce yourself briefly. I am a professional musician and composer. I have been playing in bands for many years and have also been composing for many years. I have a very wide range of experience, having played in many different bands, and have a very diverse musical background.
4번째 답변 생성 시작
걸린 시간: 5.71초
생성된 토큰: 50개
4번째 추론 속도 (TPS): 8.76 tokens/sec
AI 답변:Hello, please introduce yourself briefly. What are you doing at the moment?
I’m a 24 year old photographer based in London. I work for a variety of publications and clients, but my main focus is on my personal projects. I’m currently working on a project called ‘The
5번째 답변 생성 시작
걸린 시간: 6.57초
생성된 토큰: 50개
5번째 추론 속도 (TPS): 7.61 tokens/sec
AI 답변:Hello, please introduce yourself briefly. What is your name, what do you do and where are you from?
My name is Alex and I am 24 years old. I am from Germany and I am a photographer.
How did you get into photography? What was the trigger?
I
6번째 답변 생성 시작
걸린 시간: 6.84초
생성된 토큰: 50개
6번째 추론 속도 (TPS): 7.31 tokens/sec
AI 답변:Hello, please introduce yourself briefly. My name is Thomas Biermann. I live in the state of Baden-Württemberg, in the district of Heidenheim, in the city of Herbrechtingen, where I was born and raised. I am a
7번째 답변 생성 시작
걸린 시간: 6.02초
생성된 토큰: 50개
7번째 추론 속도 (TPS): 8.31 tokens/sec
AI 답변:Hello, please introduce yourself briefly. What is your name, where do you come from and what do you do?
I’m Yannick, 26 years old, born and raised in the south of Germany. I’m a freelance photographer and I’m currently living in Berlin.
What
8번째 답변 생성 시작
걸린 시간: 7.05초
생성된 토큰: 50개
8번째 추론 속도 (TPS): 7.09 tokens/sec
AI 답변:Hello, please introduce yourself briefly. Who are you, what do you do and what is your background?
I am a 24 year old woman who lives in Berlin, Germany. I studied social sciences and political science and now work as a social worker in a refugee shelter. I am
9번째 답변 생성 시작
걸린 시간: 6.50초
생성된 토큰: 50개
9번째 추론 속도 (TPS): 7.69 tokens/sec
AI 답변:Hello, please introduce yourself briefly. Who are you, what do you do, what do you like to do?
I am a 26-year-old software developer. I live in Berlin and work at a tech company in the city. I like to play video games, read, write
10번째 답변 생성 시작
걸린 시간: 5.83초
생성된 토큰: 50개
10번째 추론 속도 (TPS): 8.58 tokens/sec
AI 답변:Hello, please introduce yourself briefly. Who are you?
My name is David H. W. van der Wolk, I am 43 years old and live in the Netherlands. I am a full-time photographer. I have been working as a photographer since 2002, and since
메모리를 비우기
청소 완료! VRAM clear