StepRoute: Step-Wise Reflective Routing with Contrastive Reward Shaping

하임·2026년 4월 27일

작성논문

목록 보기
3/3

https://openreview.net/pdf?id=X4jyt8DWHN


🔥 StepRoute: Step-wise Reflective Routing 정리

“Cost-aware LLM routing을 step 단위에서 수행하는 방법”


1. 📌 Problem Setting

LLM 기반 추론에서의 목표는 다음과 같다:

정확도를 유지하면서 비용을 최소화하는 것


Model Pool 정의

M=M1,M2,,MK\mathcal{M} = {M_1, M_2, \dots, M_K}

  • ( M1M_1 ): Small Language Model (SLM)
  • ( M2,,MKM_2, \dots, M_K ): Large Language Models (LLMs)

Input

  • 문제 ( qq )

2. 🧠 Inference Pipeline

StepRoute는 3단계로 구성된다.


(1) Step Decomposition

SLM이 입력을 여러 step으로 분해:

(st)t=1T=M1(Idec(q))(s_t){t=1}^T = M_1(I{\text{dec}}(q))


(2) SLM Chain Generation

ychain=M1(Ichain(q,(st)t=1T))y_{\text{chain}} = M_1(I_{\text{chain}}(q, (s_t)_{t=1}^T))

  • step별 chunk ( ctc_t )
  • SLM 답 ( a^SLM\hat{a}_{\text{SLM}} )

(3) Step-wise Routing

각 step에서 routing 수행:

  • SLM 유지
  • 또는 LLM 호출

3. 🎯 Routing Decision Space

각 step에서 action:
p=(g,m)p = (g, m)

  • ( g0,1g \in {0,1} )
    • ( g=0g = 0 ): SLM 유지
    • ( g=1g = 1 ): LLM 호출
  • ( m1,,Km \in {1, \dots, K} )

Action Space

P=(0,1)(1,m):m2,,K\mathcal{P} = {(0,1)} \cup {(1,m) : m \in {2, \dots, K}}


4. 🧩 Routing Context

각 step에서의 입력:
Xt=(q,st,Ht,ct)X_t = (q, s_t, H_t, c_t)

  • ( HtH_t ): 이전 step 요약
  • ( ctc_t ): 현재 reasoning chunk

5. 🧠 Router Parameterization

Context Embedding

ht=Pool(EncSLM(ser(Xt))),qt=norm(Wht)h_t = \text{Pool}(\text{Enc}_{\text{SLM}}(\text{ser}(X_t))), \quad q_t = \text{norm}(W h_t)


Pair Scoring

s(pXt)=qt,eg(g)+bg(g)s(p \mid X_t) = \langle q_t, e^{(g)}_g \rangle + b^{(g)}_g

  • qt,em(m)+bm(m)\langle q_t, e^{(m)}_m \rangle + b^{(m)}_m
  • bp(p)b^{(p)}_p

Policy

πθ(pXt)=softmax([s(pXt)]pP)\pi_\theta(p \mid X_t) = \text{softmax}([s(p \mid X_t)]_{p \in \mathcal{P}})


6. 🎯 Supervised Learning

Soft Target

y~t(p)=(1ρs)e(pt)(p)\tilde{y}_t(p) = (1 - \rho_s) e(p_t^*)(p)

  • ρs,ut(p)\rho_s , u_t(p)

Loss

LSFT=CE(st,y~t)\mathcal{L}_{\text{SFT}} = \text{CE}(s_t, \tilde{y}_t)


7. 🚀 Reinforcement Learning (GRPO)

StepRoute의 핵심은 routing을 policy optimization 문제로 보는 것이다.


Sampling

pt(j)j=1Gπθ(Xt){p_t^{(j)}}{j=1}^G \sim \pi\theta(\cdot \mid X_t)


Reward 구성

(1) Base Reward

R(j)base=IgateR^{(j)}{\text{base}} = I{\text{gate}}

  • γ1Imodel\gamma_1 I_{\text{model}}
  • γ2Imiss\gamma_2 I_{\text{miss}}
  • γ3Istay\gamma_3 I_{\text{stay}}

(2) Contrastive Reward

Δt=sim(pt)maxpptsim(p)\Delta_t = \text{sim}(p_t^*) - \max_{p \neq p_t^*} \text{sim}(p)

Rctr(j)={max(0,Δt)p=pt max(0,Δt)otherwiseR^{(j)}_{\text{ctr}} = \begin{cases} \max(0, \Delta_t) & p = p_t^* \ -\max(0, \Delta_t) & \text{otherwise} \end{cases}


Total Reward

Rt(j)=Rbase(j)+λctrRctr(j)R_t^{(j)} = R_{\text{base}}^{(j)} + \lambda_{\text{ctr}} R_{\text{ctr}}^{(j)}


Advantage (LOO)

Ab(j)=Rb(j)Rˉb,jVar(Rb)+ϵA_b^{(j)} = \frac{R_b^{(j)} - \bar{R}_{b,\setminus j}} {\sqrt{\text{Var}(R_b) + \epsilon}}


Objective

LRL=E[Alogπθ]+λKLKL+λCECE+λHH\mathcal{L}{\text{RL}} = -\mathbb{E}[A \cdot \log \pi\theta] + \lambda_{KL} KL + \lambda_{CE} CE + \lambda_H H


8. ⚙️ Execution Rule

  • 모든 step에서 ( g = 0 ) → SLM 결과 사용
  • 최초 ( g = 1 ) 발생 시:
    a^=Mm(Illm(q))\hat{a} = M_m(I_{\text{llm}}(q))

👉 Early termination


9. 📊 Cost Model

Cost=mcmoutTmout\text{Cost} = \sum_m c_m^{\text{out}} \cdot T_m^{\text{out}}

👉 output token 기준 비용 계산


10. 📈 Key Results

핵심 성능

  • SLM 대비 큰 성능 향상
  • LLM 대비 비용 절감

Main results comparing StepRoute with theBaseline (SLM-only) and LLM-only


OOD 성능

  • OpenBookQA: 0.448 → 0.922
  • RACE-middle: 0.387 → 0.7987

👉 강한 generalization

Comparison of another routing methods


11. 🔍 핵심 해석

StepRoute가 효과적인 이유:

  1. Intermediate context 활용
  2. Step-level granularity
  3. Early stopping
  4. Cost-aware RL optimization

12. 📌 Summary

StepRoute는 step 단위에서 routing policy를 학습하여
accuracy–cost trade-off를 최적화하는 방법
이다.


13. 💭 Insight

기존 방법:

  • Query-level → 정보 부족
  • Token-level → 비용 과다

👉 StepRoute:

정보와 효율성 사이의 optimal point


14. 🚀 Extension Ideas

  • entropy-based routing
  • KV-cache-aware routing
  • confidence-based fallback
  • projection-based routing (ProRouter)

📎 Reference

  • StepRoute: Step-Wise Reflective Routing with Contrastive Reward Shaping

profile
NLP 공부합니당

0개의 댓글