Unsupervised Language Models → trained on data generated by humans.
It cannot understand common mistakes by human (human want the model to be biased to high-quality answer)
It is important to select the model's desired responses and behavior from its knowledge. → Using RL
Existing skills → using curated sets of human preference
(pretraining → preference learning)
But the most efficient strategy is SFT
RLHF
Fit a reward model to a human preference
After that, use RL to optimize a language model policy
more complex than SFT (training multiple LMs, sampling from the LM policy)
DPO (Direct Preference Optimization)
Without explicit reward modeling / RL
implicitly optimizes the same objective as existing RLHF (reward maximization with KL-divergence constraint)
simple to implement, starightforward to train
dynamic, per-example importance weight → prevents model degeneration
rely on theoretical preference model (Bradley-Terry / Plackett-Luce)
define the preference loss as a function of the policy
2. Related Work
Zero-Shot, Few-Shot
instruction tuning
fine-tune on human preference datasets under BT, PL model (RLHF, using PPO, REINFORCE)
Outside the context of LM
Learning Policies from preferences has been studied in both bandit and reinforcement learning settings.
Contextual Bandit Learning (CDB, Contextual Dueling Bandit, using Von Neumann Winner(policy whose expected win rate against any other policy is at least 50%))
Preference-Based RL (learning binary preference generated by unknown scoring function)
3. Preliminary
Review RLHF Pipelines
SFT Phase
Fine Tune LM with high-quality dataset of downstream task
Obtain πSFT
Reward Modeling Phase
Use SFT Model πSFT and prompt x to get pairs of answers y1 and y2
(y1,y2)∼πSFT(y∣x)
Human labels the answer
yw≻yl (winner and loser)
These are assumed to be generated by some latent reward model r∗(y,x)
Assuming a static dataset D={x(i),yw(i),yl(i)}i=1N sampled from p∗, parametrize a reward model rϕ(x,y) and estimate the parameters using maximum likelihood
For Binary Classification, the negative log-likelihood is LR(rϕ,D)=−E(x,yw,yl)∼D[logσ(rϕ(x,yw)−rϕ(x,yl))] where σ is logistic function
for Language Model, rϕ(x,y) is often initialized from πSFT(y∣x) with the addition of a linear layer on top of the final transformer layer → produces a single scalar prediction
RL Fine-Tuning Phase
Use learned reward function to provide feedback to the language model
the optimizing problem is formulated as maxπθEx∼D,y∼πθ(y∣x)[rϕ(x,y)]−βDKL[πθ(y∣x)∣∣πref(y∣x)]
β is a parameter controlling the deviation from the base reference policy πref (namely, πSFT)
In practice, πθ is also initialized to πSFT
As language generation is descrete, use reward function r(x,y)=rϕ(x,y)−β(logπθ(y∣x)−logπref(y∣x)) and maximize with PPO
Even if we use MLE estimate rϕ of the ground-truth reward function r∗, it is difficult to estimate Z(x) which is the partition function.
Using Logarithm, the reward function is expressed as
Finally, the loss is independent of r which means it bypasses the explicit reward modeling step. Also, the theoretical property of Bradley-Terry model holds.
What does the DPO update to?
The gradient of the Loss function of DPO with respect to θ is
∇θLDPO(πθ;πref)=−βE(x,yw,yl)∼D[higher weight when reward estimate is wrongσ(rθ^(x,yl)−rθ^(x,yw))[increase likelihood of yw∇θlogπ(yw∣x)−decrease likelihood of yl∇θlogπ(yl∣x)]]
where rθ^(x,y)=βlogπref(y∣x)πθ(y∣x) is the reward implicitly defined by two LMs.
This loss increases the likelihood of yw and decreases the likelihood of yl. Also, the examples are weighed by how much rθ^(x,y) dispreferred yl, scaled by β.
Our experiment suggests the importance of weighing.
DPO outline
The general DPO pipeline is:
Sample completions y1, y2 from reference model πref
label with human preference to construct offline dataset D={x(i),yw(i),yl(i)}
Optimize the language model πθ to minimize LDPO given πref and D and β (Also one can reuse publicly available dataset)
If πSFT is available, set πref=πSFT otherwise, set πref by maximizing likelihood of preferred completions (x,yw)
5. Theoretical Analysis of DPO
5.1 Your Language Model Is Secretly a Reward Model
First Lemma is well-known under-specification issue with Plackett-Luce family. We should impose additional idenfifiability constraints. The final objective is to recover an arbitrary reward function from the optimal class.
Proof for Lemma 1
Proof for Lemma 2
Theorem 1 also specifies exactly which reward function within each equivalence class the DPO reparameterization selects.
The reward function satisfies
y∑=π(y∣x)πref(y∣x)exp(β1r(x,y))=1
i.e. π(y∣x) is valid distribution.
Also, we can impose certain constraints on the under-constrained Plackett-Luce family such that we preserve the class of reward models.
5.2 Instability of Actor-Critic Algorithms
In RL Fine-Tuning step from RLHF, the optimization objective is
This is equivalent reward to DPO. Without the normalization term in f which can be interpreted as the soft value function of πref, the learning could be unstable as the policy gradient could have high variance.
Prior work used human completion based reward normalization but DPO yields baseline-free reward.
6. Experiments
well-controlled text-generation : maximize reward and minimize KL-divergence
IMDb
preference pair : pretrained sentiment classifier
SFT : GPT-2-large
summarization and dialog : DPO performance on larger models and more difficult RLHF task
Summarization : Reddit TL;DR dataset
Single-Turn dialogue : Anthropic Helpful and Harmless dataset
Evaluation with win rate against baseline policy, GPT-4 for proxy of human
GPT-4(S) : "Which Summary is better?"
GPT-4(C) : "Which Sumary is more concise?"
7. Discussion
rather than standard RL approach, DPO identifies a mapping between LM policies and reward functions → training LM to satisfy human preference directly with simple cross entropy loss without RL
No hyperparameter tuning is enough to perform like RLHF
Need High-Quality Automated Judgement (GPT-4 is impacted by prompt)
This study applied DPO up to 6B models → larger models?