Direct Preference Optimization: Your Language Model is Secretly a Reward Model

임재석·2024년 1월 4일

Fine Tuning LLM

paper-study

목록 보기

4/23

1. Introduction

Unsupervised Language Models $\rightarrow$ trained on data generated by humans.
It cannot understand common mistakes by human (human want the model to be biased to high-quality answer)
It is important to select the model's desired responses and behavior from its knowledge. $\rightarrow$ Using RL
Existing skills $\rightarrow$ using curated sets of human preference
(pretraining $\rightarrow$ preference learning)
But the most efficient strategy is SFT

RLHF

Fit a reward model to a human preference
After that, use RL to optimize a language model policy
more complex than SFT (training multiple LMs, sampling from the LM policy)

DPO (Direct Preference Optimization)

Without explicit reward modeling / RL
implicitly optimizes the same objective as existing RLHF (reward maximization with KL-divergence constraint)
simple to implement, starightforward to train
dynamic, per-example importance weight $\rightarrow$ prevents model degeneration
rely on theoretical preference model (Bradley-Terry / Plackett-Luce)
define the preference loss as a function of the policy

Zero-Shot, Few-Shot
instruction tuning
fine-tune on human preference datasets under BT, PL model (RLHF, using PPO, REINFORCE)

Outside the context of LM

Learning Policies from preferences has been studied in both bandit and reinforcement learning settings.
Contextual Bandit Learning (CDB, Contextual Dueling Bandit, using Von Neumann Winner(policy whose expected win rate against any other policy is at least 50%))
Preference-Based RL (learning binary preference generated by unknown scoring function)

3. Preliminary

Review RLHF Pipelines

SFT Phase

Fine Tune LM with high-quality dataset of downstream task
Obtain $\pi^{\text{SFT}}$

Reward Modeling Phase

Use SFT Model $\pi^{\text{SFT}}$ and prompt $x$ to get pairs of answers $y_1$ and $y_2$
- $(y_1, y_2) \sim \pi^{\text{SFT}} (y \ | \ x)$

Human labels the answer
- $y_w \succ y_l$ (winner and loser)
These are assumed to be generated by some latent reward model $r^* (y, x)$
- Actually this is not accessable
- use Bradley-Terry or Plackett-Luce
BT model's human preference distribution $p^*$
- $\begin{aligned} p^*(y_1 \succ y_2 | x) &= {{\exp (r^* (x, y_1))} \over {\exp(r^*(x, y_1)) + \exp(r^*(x, y_2))}} \\ &={1 \over 1 + \exp\left( r^*(x, y_2) - r^*(x, y_1)\right)} \\ &= \sigma(r^*(x, y_2) - r^*(x, y_1)) \end{aligned}$
Assuming a static dataset $\mathcal{D} = \{ x^{(i)}, y_w^{(i)}, y_l^{(i)} \}_{i=1} ^N$ sampled from $p^*$ , parametrize a reward model $r_{\phi}(x, y)$ and estimate the parameters using maximum likelihood
- For Binary Classification, the negative log-likelihood is
  $\mathcal{L}_\text{R} (r_{\phi}, \mathcal{D}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} [log \ \sigma (r_{\phi} (x, y_w) - r_{\phi} (x, y_l))]$ where $\sigma$ is logistic function
- for Language Model, $r_{\phi}(x, y)$ is often initialized from $\pi^{\text{SFT}}(y \ | \ x)$ with the addition of a linear layer on top of the final transformer layer $\rightarrow$ produces a single scalar prediction

RL Fine-Tuning Phase

Use learned reward function to provide feedback to the language model
- the optimizing problem is formulated as
  $\max_{\pi_\theta} {\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta (y \ | \ x)} [r_{\phi} (x,y)] - \beta \mathbb{D}_{\text{KL}} [\pi_\theta (y \ | \ x) \ || \ \pi_{\text{ref}} (y \ | \ x)]}$
- $\beta$ is a parameter controlling the deviation from the base reference policy $\pi_{\text{ref}}$ (namely, $\pi^{\text{SFT}}$ )
- In practice, $\pi_\theta$ is also initialized to $\pi^{\text{SFT}}$
- As language generation is descrete, use reward function
  $r(x,y) = r_{\phi}(x,y) - \beta(\log \pi_\theta (y \ | \ x) - \log \pi_{\text{ref}}(y \ | \ x))$ and maximize with PPO

Bradley-Terry Model (for backup)

Rank entities by pairwise comparisons

P(y_1 \succ y_2) = {p_1 \over p_1 + p_2}

where $p$ is real-valued score

Plackett-Luce Model (for backup)

Rank entities by its worth

S = \{ i_1, i_2, ... i_J \} \\ P(i_1 \ | \ S) = {\alpha_1 \over \sum_{i \in J} \alpha_i} \\ P( i_1 \succ i_2 \succ ... \succ i_J) = \prod_{j=1}^J {\alpha_{i_j} \over \sum_{i \in S -i_j} \alpha_i}

4. Direct Preference Optimization

Applying RL to large-scale model is challenging
DPO bypasses the reward modeling step and directly optimizes a LM using preference data

Deriving DPO Objective

Start with RL objective as prior work

\begin{aligned} \max_{\pi_\theta} {\mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta (y \ | \ x)} [r_{\phi} (x,y)] - \beta \mathbb{D}_{\text{KL}} [\pi_\theta (y \ | \ x) \ || \ \pi_{\text{ref}} (y \ | \ x)]}\end{aligned}

the optimal solution is

\begin{aligned} &\pi_r(y \ | \ x) = {1 \over Z(x)} \pi_{\text{ref}}(y \ | \ x) \exp \left( { 1 \over \beta } r(x, y)\right) \\ &Z(x) = \sum_y \pi_{\text{ref}} (y \ | \ x)\exp\left({ 1\over\beta}r(x,y) \right) \end{aligned}

Even if we use MLE estimate $r_{\phi}$ of the ground-truth reward function $r^*$ , it is difficult to estimate $Z(x)$ which is the partition function.
Using Logarithm, the reward function is expressed as

\begin{aligned} \log \pi_r(y \ | \ x) &= \log \left({1 \over Z(x)} \pi_{\text{ref}}(y \ | \ x) \exp \left( { 1 \over \beta } r(x, y)\right) \right) \\ &= -\log Z(x) + \log \pi_{\text{ref}} (y \ | \ x) + {1 \over \beta}r(x,y) \end{aligned}

\therefore r(x,y) =\beta \log {{{\pi_r (y \ | \ x)} \over {\pi_{\text{ref}} (y \ | \ x)}} - \beta \log Z(x)}

For Bradley-Terry case, using $r(x,y)$ above, the optimal RLHF policy $\pi^*$ satisfies

p^*(y_1 \succ y_2 \ | \ x) = {1 \over 1 + \exp \left( \beta \log {\pi^* (y_2 \ | \ x) \over \pi_{\text{ref}} (y_2 \ | \ x)} - \beta \log {\pi^* (y_1 \ | \ x) \over \pi_{\text{ref}} (y_1 \ | \ x) } \right)}

Then the DPO loss becomes

\mathcal{L}_{\text{DPO}}(\pi_\theta ; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \bigg[ \log \sigma \left( \beta \log {\pi_\theta (y_w \ | \ x) \over \pi_{\text{ref}} (y_w \ | \ x)} - \beta \log {\pi_\theta (y_l \ | \ x) \over \pi_{\text{ref}} (y_l \ | \ x) }\right) \bigg]

Finally, the loss is independent of $r$ which means it bypasses the explicit reward modeling step. Also, the theoretical property of Bradley-Terry model holds.

What does the DPO update to?

The gradient of the Loss function of DPO with respect to $\theta$ is

\begin{aligned} \nabla_\theta\mathcal{L}_{\text{DPO}}(&\pi_\theta; \pi_{\text{ref}}) = \\ &-\beta \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \bigg[ \underbrace{\sigma (\hat{r_\theta} (x, y_l) - \hat{r_\theta} (x, y_w))}_{\text{higher weight when reward estimate is wrong}} \bigg[ \underbrace{\nabla_\theta \log \pi (y_w \ | \ x)}_{\text{increase likelihood of }y_w} - \underbrace{\nabla_\theta \log \pi (y_l \ | \ x)}_{\text{decrease likelihood of }y_l} \bigg] \bigg] \end{aligned}

where $\hat{r_\theta} (x,y) = \beta \log {\pi_\theta (y \ | \ x) \over \pi_{\text{ref}} (y \ | \ x)}$ is the reward implicitly defined by two LMs.

This loss increases the likelihood of $y_w$ and decreases the likelihood of $y_l$ . Also, the examples are weighed by how much $\hat{r_\theta} (x,y)$ dispreferred $y_l$ , scaled by $\beta$ .
Our experiment suggests the importance of weighing.

DPO outline

The general DPO pipeline is:

Sample completions $y_1$ , $y_2$ from reference model $\pi_{\text{ref}}$
label with human preference to construct offline dataset $\mathcal{D} = \{x^{(i)}, y_w^{(i)}, y_l^{(i)} \}$
Optimize the language model $\pi_\theta$ to minimize $\mathcal{L}_{\text{DPO}}$ given $\pi_{\text{ref}}$ and $\mathcal{D}$ and $\beta$ (Also one can reuse publicly available dataset)
If $\pi^{\text{SFT}}$ is available, set $\pi_{\text{ref}} = \pi^{\text{SFT}}$ otherwise, set $\pi_{\text{ref}}$ by maximizing likelihood of preferred completions $(x, y_w)$

5. Theoretical Analysis of DPO

5.1 Your Language Model Is Secretly a Reward Model

First Lemma is well-known under-specification issue with Plackett-Luce family. We should impose additional idenfifiability constraints. The final objective is to recover an arbitrary reward function from the optimal class.

Proof for Lemma 1

Proof for Lemma 2

Theorem 1 also specifies exactly which reward function within each equivalence class the DPO reparameterization selects.
The reward function satisfies

\sum _y \underbrace{\pi_{\text{ref}} (y \ | \ x) \exp \left( { 1\over \beta} r(x,y)\right)}_{=\pi(y \ | \ x)} = 1

i.e. $\pi(y \ | \ x)$ is valid distribution.
Also, we can impose certain constraints on the under-constrained Plackett-Luce family such that we preserve the class of reward models.

5.2 Instability of Actor-Critic Algorithms

In RL Fine-Tuning step from RLHF, the optimization objective is

\max_{\pi_\theta} \mathbb{E}_{\pi_\theta (y \ | \ x)} \bigg[ \underbrace{r_\phi (x,y) - \beta \log \sum_y \pi_{\text{ref}} \exp \left( { 1\over \beta } r(x,y)\right)}_{f(r_\phi, \pi_{\text{ref}}, \beta)} -\underbrace{\beta \log {\pi_\theta (y \ | \ x) \over \pi_{\text{ref}} (y \ | \ x)}}_{\text{KL}}\bigg]

This is equivalent reward to DPO. Without the normalization term in $f$ which can be interpreted as the soft value function of $\pi_{\text{ref}}$ , the learning could be unstable as the policy gradient could have high variance.
Prior work used human completion based reward normalization but DPO yields baseline-free reward.

6. Experiments

well-controlled text-generation : maximize reward and minimize KL-divergence
- IMDb
- preference pair : pretrained sentiment classifier
- SFT : GPT-2-large
summarization and dialog : DPO performance on larger models and more difficult RLHF task
- Summarization : Reddit TL;DR dataset
- Single-Turn dialogue : Anthropic Helpful and Harmless dataset
Evaluation with win rate against baseline policy, GPT-4 for proxy of human

GPT-4(S) : "Which Summary is better?"
GPT-4(C) : "Which Sumary is more concise?"

7. Discussion

rather than standard RL approach, DPO identifies a mapping between LM policies and reward functions
$\rightarrow$ training LM to satisfy human preference directly with simple cross entropy loss without RL
No hyperparameter tuning is enough to perform like RLHF
Need High-Quality Automated Judgement (GPT-4 is impacted by prompt)
This study applied DPO up to 6B models $\rightarrow$ larger models?

임재석

이전 포스트

Rephrase and Respond: Let LLMs Ask Better Questions for Themselves

다음 포스트

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

paper-study

1. Introduction

RLHF

DPO (Direct Preference Optimization)

Outside the context of LM

3. Preliminary

SFT Phase

Reward Modeling Phase

RL Fine-Tuning Phase

Bradley-Terry Model (for backup)

Plackett-Luce Model (for backup)

4. Direct Preference Optimization

Deriving DPO Objective

What does the DPO update to?

DPO outline

5. Theoretical Analysis of DPO

5.1 Your Language Model Is Secretly a Reward Model

5.2 Instability of Actor-Critic Algorithms

6. Experiments

7. Discussion

Rephrase and Respond: Let LLMs Ask Better Questions for Themselves

SOLAR 10.7B: Scaling LLMs with Simple yet Effective Depth Up-Scaling

0개의 댓글

관련 채용 정보

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

paper-study

1. Introduction

RLHF

DPO (Direct Preference Optimization)

2. Related Work

Outside the context of LM

3. Preliminary

SFT Phase

Reward Modeling Phase

RL Fine-Tuning Phase

Bradley-Terry Model (for backup)

Plackett-Luce Model (for backup)

4. Direct Preference Optimization

Deriving DPO Objective

What does the DPO update to?

DPO outline

5. Theoretical Analysis of DPO

5.1 Your Language Model Is Secretly a Reward Model

5.2 Instability of Actor-Critic Algorithms

6. Experiments

7. Discussion

Rephrase and Respond: Let LLMs Ask Better Questions for Themselves

SOLAR 10.7B: Scaling LLMs with Simple yet Effective Depth Up-Scaling

0개의 댓글

관련 채용 정보