StepRoute: Step-Wise Reflective Routing with Contrastive Reward Shaping

하임·2026년 4월 27일

작성논문

목록 보기

3/3

https://openreview.net/pdf?id=X4jyt8DWHN

🔥 StepRoute: Step-wise Reflective Routing 정리

“Cost-aware LLM routing을 step 단위에서 수행하는 방법”

1. 📌 Problem Setting

LLM 기반 추론에서의 목표는 다음과 같다:

정확도를 유지하면서 비용을 최소화하는 것

Model Pool 정의

$\mathcal{M} = {M_1, M_2, \dots, M_K}$

( $M_1$ ): Small Language Model (SLM)
( $M_2, \dots, M_K$ ): Large Language Models (LLMs)

Input

문제 ( $q$ )

2. 🧠 Inference Pipeline

StepRoute는 3단계로 구성된다.

(1) Step Decomposition

SLM이 입력을 여러 step으로 분해:

$(s_t){t=1}^T = M_1(I{\text{dec}}(q))$

(2) SLM Chain Generation

$y_{\text{chain}} = M_1(I_{\text{chain}}(q, (s_t)_{t=1}^T))$

step별 chunk ( $c_t$ )
SLM 답 ( $\hat{a}_{\text{SLM}}$ )

(3) Step-wise Routing

각 step에서 routing 수행:

SLM 유지
또는 LLM 호출

3. 🎯 Routing Decision Space

각 step에서 action:
$p = (g, m)$

( $g \in {0,1}$ )
- ( $g = 0$ ): SLM 유지
- ( $g = 1$ ): LLM 호출
( $m \in {1, \dots, K}$ )

Action Space

$\mathcal{P} = {(0,1)} \cup {(1,m) : m \in {2, \dots, K}}$

4. 🧩 Routing Context

각 step에서의 입력:
$X_t = (q, s_t, H_t, c_t)$

( $H_t$ ): 이전 step 요약
( $c_t$ ): 현재 reasoning chunk

5. 🧠 Router Parameterization

Context Embedding

$h_t = \text{Pool}(\text{Enc}_{\text{SLM}}(\text{ser}(X_t))), \quad q_t = \text{norm}(W h_t)$

Pair Scoring

$s(p \mid X_t) = \langle q_t, e^{(g)}_g \rangle + b^{(g)}_g$

$\langle q_t, e^{(m)}_m \rangle + b^{(m)}_m$
$b^{(p)}_p$

Policy

$\pi_\theta(p \mid X_t) = \text{softmax}([s(p \mid X_t)]_{p \in \mathcal{P}})$

6. 🎯 Supervised Learning

Soft Target

$\tilde{y}_t(p) = (1 - \rho_s) e(p_t^*)(p)$

$\rho_s , u_t(p)$

Loss

$\mathcal{L}_{\text{SFT}} = \text{CE}(s_t, \tilde{y}_t)$

7. 🚀 Reinforcement Learning (GRPO)

StepRoute의 핵심은 routing을 policy optimization 문제로 보는 것이다.

Sampling

${p_t^{(j)}}{j=1}^G \sim \pi\theta(\cdot \mid X_t)$

Reward 구성

(1) Base Reward

$R^{(j)}{\text{base}} = I{\text{gate}}$

$\gamma_1 I_{\text{model}}$
$\gamma_2 I_{\text{miss}}$
$\gamma_3 I_{\text{stay}}$

(2) Contrastive Reward

$\Delta_t = \text{sim}(p_t^*) - \max_{p \neq p_t^*} \text{sim}(p)$

$R^{(j)}_{\text{ctr}} = \begin{cases} \max(0, \Delta_t) & p = p_t^* \ -\max(0, \Delta_t) & \text{otherwise} \end{cases}$

Total Reward

$R_t^{(j)} = R_{\text{base}}^{(j)} + \lambda_{\text{ctr}} R_{\text{ctr}}^{(j)}$

Advantage (LOO)

$A_b^{(j)} = \frac{R_b^{(j)} - \bar{R}_{b,\setminus j}} {\sqrt{\text{Var}(R_b) + \epsilon}}$

Objective

$\mathcal{L}{\text{RL}} = -\mathbb{E}[A \cdot \log \pi\theta] + \lambda_{KL} KL + \lambda_{CE} CE + \lambda_H H$

8. ⚙️ Execution Rule

모든 step에서 ( g = 0 ) → SLM 결과 사용
최초 ( g = 1 ) 발생 시:
$\hat{a} = M_m(I_{\text{llm}}(q))$

👉 Early termination

9. 📊 Cost Model

$\text{Cost} = \sum_m c_m^{\text{out}} \cdot T_m^{\text{out}}$

👉 output token 기준 비용 계산

10. 📈 Key Results

핵심 성능

SLM 대비 큰 성능 향상
LLM 대비 비용 절감

Main results comparing StepRoute with theBaseline (SLM-only) and LLM-only

OOD 성능

OpenBookQA: 0.448 → 0.922
RACE-middle: 0.387 → 0.7987

👉 강한 generalization

Comparison of another routing methods

11. 🔍 핵심 해석

StepRoute가 효과적인 이유:

Intermediate context 활용
Step-level granularity
Early stopping
Cost-aware RL optimization

12. 📌 Summary

StepRoute는 step 단위에서 routing policy를 학습하여
accuracy–cost trade-off를 최적화하는 방법이다.

13. 💭 Insight

기존 방법:

Query-level → 정보 부족
Token-level → 비용 과다

👉 StepRoute:

정보와 효율성 사이의 optimal point

14. 🚀 Extension Ideas

entropy-based routing
KV-cache-aware routing
confidence-based fallback
projection-based routing (ProRouter)

📎 Reference

StepRoute: Step-Wise Reflective Routing with Contrastive Reward Shaping

하임

NLP 공부합니당

이전 포스트