Direct Preference Optimization: Your Language Model is Secretly a Reward Model

임재석·2024년 1월 4일
0

paper-study

목록 보기
4/23

1. Introduction

  • Unsupervised Language Models \rightarrow trained on data generated by humans.
  • It cannot understand common mistakes by human (human want the model to be biased to high-quality answer)
  • It is important to select the model's desired responses and behavior from its knowledge. \rightarrow Using RL
  • Existing skills \rightarrow using curated sets of human preference
    (pretraining \rightarrow preference learning)
  • But the most efficient strategy is SFT

RLHF

  • Fit a reward model to a human preference
  • After that, use RL to optimize a language model policy
  • more complex than SFT (training multiple LMs, sampling from the LM policy)

DPO (Direct Preference Optimization)

  • Without explicit reward modeling / RL
  • implicitly optimizes the same objective as existing RLHF (reward maximization with KL-divergence constraint)
  • simple to implement, starightforward to train
  • dynamic, per-example importance weight \rightarrow prevents model degeneration
  • rely on theoretical preference model (Bradley-Terry / Plackett-Luce)
  • define the preference loss as a function of the policy

2. Related Work

  • Zero-Shot, Few-Shot
  • instruction tuning
  • fine-tune on human preference datasets under BT, PL model (RLHF, using PPO, REINFORCE)

Outside the context of LM

  • Learning Policies from preferences has been studied in both bandit and reinforcement learning settings.
  • Contextual Bandit Learning (CDB, Contextual Dueling Bandit, using Von Neumann Winner(policy whose expected win rate against any other policy is at least 50%))
  • Preference-Based RL (learning binary preference generated by unknown scoring function)

3. Preliminary

Review RLHF Pipelines

SFT Phase

  • Fine Tune LM with high-quality dataset of downstream task
  • Obtain πSFT\pi^{\text{SFT}}

Reward Modeling Phase

  • Use SFT Model πSFT\pi^{\text{SFT}} and prompt xx to get pairs of answers y1y_1 and y2y_2

    • (y1,y2)πSFT(y  x)(y_1, y_2) \sim \pi^{\text{SFT}} (y \ | \ x)
  • Human labels the answer

    • ywyly_w \succ y_l (winner and loser)
  • These are assumed to be generated by some latent reward model r(y,x)r^* (y, x)

    • Actually this is not accessable
    • use Bradley-Terry or Plackett-Luce
  • BT model's human preference distribution pp^*

    • p(y1y2x)=exp(r(x,y1))exp(r(x,y1))+exp(r(x,y2))=11+exp(r(x,y2)r(x,y1))=σ(r(x,y2)r(x,y1))\begin{aligned} p^*(y_1 \succ y_2 | x) &= {{\exp (r^* (x, y_1))} \over {\exp(r^*(x, y_1)) + \exp(r^*(x, y_2))}} \\ &={1 \over 1 + \exp\left( r^*(x, y_2) - r^*(x, y_1)\right)} \\ &= \sigma(r^*(x, y_2) - r^*(x, y_1)) \end{aligned}
  • Assuming a static dataset D={x(i),yw(i),yl(i)}i=1N\mathcal{D} = \{ x^{(i)}, y_w^{(i)}, y_l^{(i)} \}_{i=1} ^N sampled from pp^*, parametrize a reward model rϕ(x,y)r_{\phi}(x, y) and estimate the parameters using maximum likelihood

    • For Binary Classification, the negative log-likelihood is
      LR(rϕ,D)=E(x,yw,yl)D[log σ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_\text{R} (r_{\phi}, \mathcal{D}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} [log \ \sigma (r_{\phi} (x, y_w) - r_{\phi} (x, y_l))] where σ\sigma is logistic function
    • for Language Model, rϕ(x,y)r_{\phi}(x, y) is often initialized from πSFT(y  x)\pi^{\text{SFT}}(y \ | \ x) with the addition of a linear layer on top of the final transformer layer \rightarrow produces a single scalar prediction

RL Fine-Tuning Phase

  • Use learned reward function to provide feedback to the language model

    • the optimizing problem is formulated as
      maxπθExD,yπθ(y  x)[rϕ(x,y)]βDKL[πθ(y  x)  πref(y  x)]\max_{\pi_\theta} {\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta (y \ | \ x)} [r_{\phi} (x,y)] - \beta \mathbb{D}_{\text{KL}} [\pi_\theta (y \ | \ x) \ || \ \pi_{\text{ref}} (y \ | \ x)]}
    • β\beta is a parameter controlling the deviation from the base reference policy πref\pi_{\text{ref}} (namely, πSFT\pi^{\text{SFT}})
    • In practice, πθ\pi_\theta is also initialized to πSFT\pi^{\text{SFT}}
    • As language generation is descrete, use reward function
      r(x,y)=rϕ(x,y)β(logπθ(y  x)logπref(y  x))r(x,y) = r_{\phi}(x,y) - \beta(\log \pi_\theta (y \ | \ x) - \log \pi_{\text{ref}}(y \ | \ x)) and maximize with PPO

Bradley-Terry Model (for backup)

Rank entities by pairwise comparisons

P(y1y2)=p1p1+p2P(y_1 \succ y_2) = {p_1 \over p_1 + p_2}

where pp is real-valued score

Plackett-Luce Model (for backup)

Rank entities by its worth

S={i1,i2,...iJ}P(i1  S)=α1iJαiP(i1i2...iJ)=j=1JαijiSijαiS = \{ i_1, i_2, ... i_J \} \\ P(i_1 \ | \ S) = {\alpha_1 \over \sum_{i \in J} \alpha_i} \\ P( i_1 \succ i_2 \succ ... \succ i_J) = \prod_{j=1}^J {\alpha_{i_j} \over \sum_{i \in S -i_j} \alpha_i}

4. Direct Preference Optimization

  • Applying RL to large-scale model is challenging
  • DPO bypasses the reward modeling step and directly optimizes a LM using preference data

Deriving DPO Objective

Start with RL objective as prior work

maxπθExD,yπθ(y  x)[rϕ(x,y)]βDKL[πθ(y  x)  πref(y  x)]\begin{aligned} \max_{\pi_\theta} {\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta (y \ | \ x)} [r_{\phi} (x,y)] - \beta \mathbb{D}_{\text{KL}} [\pi_\theta (y \ | \ x) \ || \ \pi_{\text{ref}} (y \ | \ x)]}\end{aligned}

the optimal solution is

πr(y  x)=1Z(x)πref(y  x)exp(1βr(x,y))Z(x)=yπref(y  x)exp(1βr(x,y))\begin{aligned} &\pi_r(y \ | \ x) = {1 \over Z(x)} \pi_{\text{ref}}(y \ | \ x) \exp \left( { 1 \over \beta } r(x, y)\right) \\ &Z(x) = \sum_y \pi_{\text{ref}} (y \ | \ x)\exp\left({ 1\over\beta}r(x,y) \right) \end{aligned}

Even if we use MLE estimate rϕr_{\phi} of the ground-truth reward function rr^*, it is difficult to estimate Z(x)Z(x) which is the partition function.
Using Logarithm, the reward function is expressed as

logπr(y  x)=log(1Z(x)πref(y  x)exp(1βr(x,y)))=logZ(x)+logπref(y  x)+1βr(x,y)\begin{aligned} \log \pi_r(y \ | \ x) &= \log \left({1 \over Z(x)} \pi_{\text{ref}}(y \ | \ x) \exp \left( { 1 \over \beta } r(x, y)\right) \right) \\ &= -\log Z(x) + \log \pi_{\text{ref}} (y \ | \ x) + {1 \over \beta}r(x,y) \end{aligned}
r(x,y)=βlogπr(y  x)πref(y  x)βlogZ(x)\therefore r(x,y) =\beta \log {{{\pi_r (y \ | \ x)} \over {\pi_{\text{ref}} (y \ | \ x)}} - \beta \log Z(x)}

For Bradley-Terry case, using r(x,y)r(x,y) above, the optimal RLHF policy π\pi^* satisfies

p(y1y2  x)=11+exp(βlogπ(y2  x)πref(y2  x)βlogπ(y1  x)πref(y1  x))p^*(y_1 \succ y_2 \ | \ x) = {1 \over 1 + \exp \left( \beta \log {\pi^* (y_2 \ | \ x) \over \pi_{\text{ref}} (y_2 \ | \ x)} - \beta \log {\pi^* (y_1 \ | \ x) \over \pi_{\text{ref}} (y_1 \ | \ x) } \right)}

Then the DPO loss becomes

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(βlogπθ(yw  x)πref(yw  x)βlogπθ(yl  x)πref(yl  x))]\mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \bigg[ \log \sigma \left( \beta \log {\pi_\theta (y_w \ | \ x) \over \pi_{\text{ref}} (y_w \ | \ x)} - \beta \log {\pi_\theta (y_l \ | \ x) \over \pi_{\text{ref}} (y_l \ | \ x) }\right) \bigg]

Finally, the loss is independent of rr which means it bypasses the explicit reward modeling step. Also, the theoretical property of Bradley-Terry model holds.

What does the DPO update to?

The gradient of the Loss function of DPO with respect to θ\theta is

θLDPO(πθ;πref)=βE(x,yw,yl)D[σ(rθ^(x,yl)rθ^(x,yw))higher weight when reward estimate is wrong[θlogπ(yw  x)increase likelihood of ywθlogπ(yl  x)decrease likelihood of yl]]\begin{aligned} \nabla_\theta\mathcal{L}_{\text{DPO}}(&\pi_\theta; \pi_{\text{ref}}) = \\ &-\beta \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \bigg[ \underbrace{\sigma (\hat{r_\theta} (x, y_l) - \hat{r_\theta} (x, y_w))}_{\text{higher weight when reward estimate is wrong}} \bigg[ \underbrace{\nabla_\theta \log \pi (y_w \ | \ x)}_{\text{increase likelihood of }y_w} - \underbrace{\nabla_\theta \log \pi (y_l \ | \ x)}_{\text{decrease likelihood of }y_l} \bigg] \bigg] \end{aligned}

where rθ^(x,y)=βlogπθ(y  x)πref(y  x)\hat{r_\theta} (x,y) = \beta \log {\pi_\theta (y \ | \ x) \over \pi_{\text{ref}} (y \ | \ x)} is the reward implicitly defined by two LMs.

This loss increases the likelihood of ywy_w and decreases the likelihood of yly_l. Also, the examples are weighed by how much rθ^(x,y)\hat{r_\theta} (x,y) dispreferred yly_l, scaled by β\beta.
Our experiment suggests the importance of weighing.

DPO outline

The general DPO pipeline is:

  • Sample completions y1y_1, y2y_2 from reference model πref\pi_{\text{ref}}

  • label with human preference to construct offline dataset D={x(i),yw(i),yl(i)}\mathcal{D} = \{x^{(i)}, y_w^{(i)}, y_l^{(i)} \}

  • Optimize the language model πθ\pi_\theta to minimize LDPO\mathcal{L}_{\text{DPO}} given πref\pi_{\text{ref}} and D\mathcal{D} and β\beta (Also one can reuse publicly available dataset)

  • If πSFT\pi^{\text{SFT}} is available, set πref=πSFT\pi_{\text{ref}} = \pi^{\text{SFT}} otherwise, set πref\pi_{\text{ref}} by maximizing likelihood of preferred completions (x,yw)(x, y_w)

    5. Theoretical Analysis of DPO

    5.1 Your Language Model Is Secretly a Reward Model

First Lemma is well-known under-specification issue with Plackett-Luce family. We should impose additional idenfifiability constraints. The final objective is to recover an arbitrary reward function from the optimal class.

Proof for Lemma 1

Proof for Lemma 2

Theorem 1 also specifies exactly which reward function within each equivalence class the DPO reparameterization selects.
The reward function satisfies

yπref(y  x)exp(1βr(x,y))=π(y  x)=1\sum _y \underbrace{\pi_{\text{ref}} (y \ | \ x) \exp \left( { 1\over \beta} r(x,y)\right)}_{=\pi(y \ | \ x)} = 1

i.e. π(y  x)\pi(y \ | \ x) is valid distribution.
Also, we can impose certain constraints on the under-constrained Plackett-Luce family such that we preserve the class of reward models.

5.2 Instability of Actor-Critic Algorithms

In RL Fine-Tuning step from RLHF, the optimization objective is

maxπθEπθ(y  x)[rϕ(x,y)βlogyπrefexp(1βr(x,y))f(rϕ,πref,β)βlogπθ(y  x)πref(y  x)KL]\max_{\pi_\theta} \mathbb{E}_{\pi_\theta (y \ | \ x)} \bigg[ \underbrace{r_\phi (x,y) - \beta \log \sum_y \pi_{\text{ref}} \exp \left( { 1\over \beta } r(x,y)\right)}_{f(r_\phi, \pi_{\text{ref}}, \beta)} -\underbrace{\beta \log {\pi_\theta (y \ | \ x) \over \pi_{\text{ref}} (y \ | \ x)}}_{\text{KL}}\bigg]

This is equivalent reward to DPO. Without the normalization term in ff which can be interpreted as the soft value function of πref\pi_{\text{ref}}, the learning could be unstable as the policy gradient could have high variance.
Prior work used human completion based reward normalization but DPO yields baseline-free reward.

6. Experiments

  • well-controlled text-generation : maximize reward and minimize KL-divergence

    • IMDb
    • preference pair : pretrained sentiment classifier
    • SFT : GPT-2-large
  • summarization and dialog : DPO performance on larger models and more difficult RLHF task

    • Summarization : Reddit TL;DR dataset
    • Single-Turn dialogue : Anthropic Helpful and Harmless dataset
  • Evaluation with win rate against baseline policy, GPT-4 for proxy of human



    GPT-4(S) : "Which Summary is better?"
    GPT-4(C) : "Which Sumary is more concise?"

7. Discussion

  • rather than standard RL approach, DPO identifies a mapping between LM policies and reward functions
    \rightarrow training LM to satisfy human preference directly with simple cross entropy loss without RL
  • No hyperparameter tuning is enough to perform like RLHF
  • Need High-Quality Automated Judgement (GPT-4 is impacted by prompt)
  • This study applied DPO up to 6B models \rightarrow larger models?

0개의 댓글

관련 채용 정보