gpt-oss reasoning 설정하기 (llama-cpp-python)

KIMHYUNSU·2025년 8월 23일

Reasoning gpt-oss llama.cpp

LLM

목록 보기

4/5

gpt-oss Reasoning 설정 가이드

llama-cpp-python의 llama객체를 사용하여 진행했습니다.

huggingface gpt-oss 모델 페이지를 보면 reasoning 설정이 가능하다고 나옵니다.

low, medium(default), high 세가지로 가능

다만 llama.cpp 파이썬 llama에서는 reasoning 관련 파라미터가 존재하지 않았고,
테스트 해본 결과 세팅 방법을 도무지 알 수 없어서... 제 나름대로 서칭 후 알아낸 직접 채팅 템플릿 수정하는 방법을 공유드립니다.

참고: https://github.com/ggml-org/llama.cpp/discussions/15396

llama-cpp-python 에서 설정하기

llama-cpp-python 라이브러리의 Jinja2ChatFormatter를 사용하면 모델이 사용할 프롬프트 템플릿을 직접 정의하고 주입할 수 있습니다.

1단계: `Reasoning: low(or mideum or high)`를 포함한 Jinja2 템플릿 정의

가장 먼저, 시스템 프롬프트에 Reasoning: low를 하드코딩한 템플릿 문자열을 만듭니다.

# 'Reasoning: low'를 고정한 미니멀 Harmony 템플릿
HARMONY_LOW_TEMPLATE = r"""
<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: {{ strftime_now("%Y-%m-%d") }}
Reasoning: low
# Valid channels: analysis, commentary, final. Channel must be included for every message.<|end|>
{%- for m in messages -%}
  {%- if m.role == "user" -%}
<|start|>user<|message|>{{ m.content }}<|end|>
  {%- elif m.role == "assistant" and m.channel == "final" -%}
<|start|>assistant<|channel|>final<|message|>{{ m.content }}<|end|>
  {%- endif -%}
{%- endfor -%}
<|start|>assistant
"""

참고: https://cookbook.openai.com/articles/openai-harmony

핵심: Reasoning: low 라인을 시스템 프롬프트에 명시적으로 추가하여 모델의 동작 모드를 '직접 답변'으로 고정합니다.
Jinja2 문법: {{ ... }}와 {%- ... -%}는 대화 기록(messages)을 순회하며 동적으로 완전한 프롬프트를 구성하는 데 사용됩니다.

2단계: `Jinja2ChatFormatter`로 템플릿 주입 준비

다음으로, 위에서 정의한 템플릿을 llama-cpp-python이 이해할 수 있는 객체로 만듭니다.

from llama_cpp.llama_chat_format import Jinja2ChatFormatter

formatter = Jinja2ChatFormatter(
    template=HARMONY_LOW_TEMPLATE,
    eos_token="<|return|>",      # 문장 끝을 알리는 토큰
    bos_token="<|startoftext|>", # 문장 시작을 알리는 토큰
)

이 formatter 객체는 이제 우리가 정의한 규칙에 따라 채팅 메시지를 프롬프트로 변환하는 '핸들러' 역할을 합니다.

3단계: `chat_handler`로 모델 로드하기

마지막으로, Llama 객체를 생성할 때 chat_handler 인자로 위에서 만든 포맷터를 전달합니다.

from llama_cpp import Llama

llm = Llama(
    model_path="path/to/your/gpt-oss-20B.gguf",
    n_gpu_layers=-1,        # GPU 사용 설정
    flash_attn=True,        # 추론 속도 향상 (지원 시)
    n_ctx=4096,
    chat_handler=formatter.to_chat_handler(),
)

이제 llm.create_chat_completion()을 호출할 때마다, 라이브러리는 내부적으로 HARMONY_LOW_TEMPLATE을 사용하여 프롬프트를 자동 생성합니다.

여기서 n_gpu_layers 값을 통해 cpu와 gpu 에 올릴 레이어 수를 조정할 수 있습니다.
- 이 내용은 https://news.hada.io/topic?id=22490 에서 확인 후 실제 테스트 하였습니다.
- 다음에 정리해보겠습니다..
사실 처음에 messages 에서 reasoning effort 를 설정해주면 되지 않을까 생각했었는데 잘 안 됐다는...

최종적으로 create_chat_completion 을 통해 응답을 확인합니다.

stream = llm.create_chat_completion(
    messages=messages,
    temperature=0.7,
    max_tokens=1024,
    stream=True,  # ★ 스트리밍 활성화
    stop=["<|return|>", "<|call|>"],
)

응답 비교

Reasoning: medium

You: 자기소개 해줘
Assistant: <|channel|>analysis<|message|>User says in Korean: "자기소개 해줘" meaning "introduce yourself". So we should reply in Korean, introducing ChatGPT. Probably short.<|start|>assistant<|channel|>final<|message|>안녕하세요! 저는 ChatGPT, OpenAI에서 개발한 대규모 언어 모델입니다. 여러분이 궁금한 것, 도움이 필요한 것, 혹은 그냥 수다를 떨고 싶을 때 언제든지 도와드릴 준비가 되어 있어요. 텍스트를 기반으로 대화를 나누며, 다양한 주제에 대해 이야기하고 정보를 제공할 수 있습니다. 편하게 물어보세요!llama_perf_context_print:        load time =     785.43 ms
llama_perf_context_print: prompt eval time =     785.29 ms /    73 tokens (   10.76 ms per token,    92.96 tokens per second)
llama_perf_context_print:        eval time =    3734.35 ms /   129 runs   (   28.95 ms per token,    34.54 tokens per second)
llama_perf_context_print:       total time =    4658.32 ms /   202 tokens
llama_perf_context_print:    graphs reused =        128

Reasoning: high

You: 자기소개 해줘
Assistant: <|channel|>analysis<|message|>The user says in Korean: "자기소개 해줘" which means "Introduce yourself" or "Tell me about yourself". The user wants a self-introduction. So I should respond in Korean, presumably, with a self-introduction. I need to respond in Korean. I should mention that I am ChatGPT, a large language model trained by OpenAI, I can assist with a variety of tasks, answer questions, help with writing, etc. I should also mention that my knowledge cutoff is 2024-06, and current date is 2025-08-20. I should keep it friendly and concise but informative. I might also ask if there's anything specific the user wants to know. But the user just asked for introduction. So I should give introduction. Maybe ask if user wants to ask something else. But mostly just introduction.

I should respond in Korean. The user likely expects Korean. So I'll respond in Korean, with a brief introduction. I can say:

안녕하세요! 저는 ChatGPT, OpenAI가 개발한 대규모 언어 모델입니다. 2024년 6월까지의 지식을 바탕으로 여러분의 질문에 답하고, 글쓰기나 번역, 학습 등 다양한 분야에서 도와드릴 수 있어요. 무엇이든 물어보세요! Also mention I can write in Korean. Also mention I'm always learning and happy to help.

I think that's enough.<|start|>assistant<|channel|>final<|message|>안녕하세요! 저는 ChatGPT, OpenAI가 개발한 대규모 언어 모델입니다. 2025년 8월 20일 기준으로 2024년 6월까지의 지식을 바탕으로 여러분의 질문에 답하고, 글쓰기, 번역, 학습, 창작 등 다양한 분야에서 도와드릴 수 있어요. 언제든 궁금한 것이나 필요한 것이 있으면 편하게 물어보세요!llama_perf_context_print:        load time =     718.79 ms
llama_perf_context_print: prompt eval time =     718.66 ms /    73 tokens (    9.84 ms per token,   101.58 tokens per second)
llama_perf_context_print:        eval time =   11574.00 ms /   394 runs   (   29.38 ms per token,    34.04 tokens per second)
llama_perf_context_print:       total time =   12764.20 ms /   467 tokens
llama_perf_context_print:    graphs reused =        392

Reasoning: low

You: 자기소개 해줘
Assistant: <|channel|>analysis<|message|>We need to respond in Korean, self-introduction.<|start|>assistant<|channel|>final<|message|>안녕하세요! 저는 ChatGPT라고 해요. OpenAI에서 개발한 인공지능 언어 모델이에요. 여러분이 궁금한 점이 있거나 도움이 필요할 때 언제든지 물어보세요! 저는 다양한 주제에 대해 이야기하고, 글을 쓰고, 문제를 해결하는 데 도움을 드릴 수 있어요. 혹시 지금 바로 궁금한 점이 있나요? 함께 이야기해봐요!llama_perf_context_print:        load time =     470.98 ms
llama_perf_context_print: prompt eval time =     470.84 ms /    73 tokens (    6.45 ms per token,   155.04 tokens per second)
llama_perf_context_print:        eval time =    2280.70 ms /   110 runs   (   20.73 ms per token,    48.23 tokens per second)
llama_perf_context_print:       total time =    2851.88 ms /   183 tokens
llama_perf_context_print:    graphs reused =        109