LLM Alignment: DPO vs IPO
Mar 27, 2025
Large language model (LLM) alignment is a crucial step in increasing LLM’s usability and safety. This topic has been received a lot of attentions, resulting in very interesting development over the last 5 years. I’m particularly interested in a line of development including 3 works:
Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022; Christiano et al., 2017; Bai et al., 2022; Stiennon et al., 2020) from OpenAI: learning score model, using RL to perform alignment
Direct preference optimization (DPO) (Rafailov et al., 2023) from Stanford: learning score model, alignment is “automatically” obtained.
IPO (or PO ) (Azar et al., 2023) from Google DeepMind: no need for learning score model, at all :))
These are very nice build-up as one addresses the previous work’s issues and reduce learning process’ complexity. An overview comparing the three method:
In the first 2 works, both RLHF and DPO’s idea are elegant and simple to grasp. This is however not the case for IPO. When I first read the IPO paper the whole thing screams “ad-hoc” and unnecessarily-complicated ;). Their derivation is a bit confusing, their empirical loss looks counter-intuitive, they don’t even have any LLM alignment experiments. But the fact that IPO method works well empirically (from my own experience as well as from other papers) bothers me quite a lot. After spending some effort to examining this method more carefully, IPO’s idea turns out to be quite nice and very clever. If you don’t get it from skimming through the paper (like I did), I hope this blog post can convince you to have another look at this method.
1. LLM alignment
LLMs obtained after a pre-training step over vast un-labeled datasets possesses an amazingly ability of natural-looking text completions. However, due to the nature of unsupervised training, the resulting models exhibit many undesired behaviors (via the generation), including being unhelpful, biased, sexist, racism, hallucination.
LLM alignment aims to address this issue by steering model’s behavior toward the desired characterizations using following formulation
Here, the so-called score function is assumed to produce a scalar value indicating how strongly the response is aligned with the desired characteristic given the prompt . We wish find a policy (a language model) such that it retains most of the good text-generation capability of , and at the same time producing responses to maximize the score function. The balance between 2 objectives is controlled by . A too small might lead to overly optimized that loses the generally good text generation of , while too large prevents from adjusting toward better alignment. The reference policy is can be seen as a good initialization.
An obvious barrier in solving (1) is the unknown . It is very non-trivial to hand-craft the score function . Instead, a popular approach is to learn using pairwise preference datasets. A sample of such dataset consists of a tuple where is a prompt, are 2 possible responses (a continuation of prompt ). These three elements are sent to a human annotator, who assigns a label to indicate which response is more preferred with the alignment objective. For instance, implies that is preferred over , denoting as .
Different methods propose to solve (1) in different ways but they all share the same principle: using the preference data to “infer” the unknown function . As the only available supervised signal is the preference dataset, it is necessary to assume certain specification to relate the unknown score function and the collected pairwise preference data.
In the following, we will go through different realizations of this relations, leading to 3 different popular techniques: RLHF, DPO, and IPO.
2. RLHF and DPO
2.1 RLHF
The structure of the optimization problem (1) is very much a typical reinforcement learning (RL) problem, except that the score function is unknown. Naturally, if one can learn , then an off-the-shelf RL method can be deployed to solve (1). This is the very idea proposed by RLHF.
To learn , RLHF assumes the Bradley-Terry (BT) model (Bradley & Terry, 1952) for the preference data generation, which hypothesizes that
Since the score function gives a higher value for a better aligned response, it is more likely that that response is preferred over the other.
We can see that this assumption specifies the relation as mentioned in (2). If we further assume the true belong to certain class of deep neural networks (such as an LLM-based network), then the true score function can be recovered using MLE:
where
In practice, can be parameterized using another large language model. After obtaining the optimal solution to (3), the alignment fine-tuning is performed by plugging into (1) and an off-the-shelf RL techniques, such as PPO (Schulman et al., 2017) is invoked to perform the alignment fine-tuning step as a pure RL problem.
This approach while being straightforward, suffers from several technical difficulties:
A 2-stage training pipeline is complex and prone to error accumulation.
The use of RL require intensive and careful hyperparameter tuning.
These challenges are the main motivations for the development of DPO.
2.2 DPO
DPO improves upon RLHF by eliminating the RL step. In particular, DPO’s authors realizes that under the same preference model (BT model), the relation between preference label and score function only depends on the relative difference in score, not the absolute score values. This enables them to use a clever trick to re-parameterize the score function via an optimal policy, effectively eliminate the needs of deploying RL.
Notice that the objective in (1) can be expressed as:
where is an intractable normalizing factor. As KL-divergence reaches minimum value at , this expression suggests an optimal solution as
which equivalently implies
This identity establishes a relation between an arbitrary score function and a corresponding optimal policy with respect to that score function. It is not very useful by itself due to the intractable factor . However, the relative score difference, which is all that matters, is independent of :
To be more concise, define
then equation (4) can be shorten as
This condition plays a key in ensuring policy being the optimal solution to the original alignment formulation (1). Particularly, it is shown in DPO that any policy satisfying condition (5) is an optimal solution to the alignment formula in (1), up to some trivial ambiguity.
With this insight, DPO proposed to use to parameterize the relative score difference between 2 responses . As before, they employ the BT model and use MLE to derive the loss function:
With this parameterization, an optimal solution to (6) gives us an optimal policy to (1) simultaneously.
2.3 The Common Theme
In terms of modeling, both RLHF and DPO rely on the same specification:
There are 2 factors in this specification:
The BT model is used to relate the score function and the preference data
The score function is assumed to belong to certain known hypothesis class , either an arbitrary neural network as in RLHF, or a structured neural network as in DPO.
The combination of the two enables learning using preference data.
Drawback.
The preference data relies on BT model. BT model is intuitive and is successfully deployed in many domains, such as economy, however, it is still restrictive and who knows if real data generation truly follows it.
The particular functional form of BT model, i.e., the sigmoid function makes it computationally difficult to model deterministic cases. Specifically, to have , the quantity in RLHF, or in DPO needs to reach . This behavior causes a particular detrimental effect: worsening the reward hacking issue in the second phase in RLHF, or overfitting in DPO where the learned policy drifting arbitrarily far away from regardless of .
And these drawback lead to the development of IPO.
3. IPO
Will be posted soon in a new post :))))))
Reference
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
- Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., & others. (2022). Constitutional ai: Harmlessness from ai feedback. ArXiv Preprint ArXiv:2212.08073.
- Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33, 3008–3021.
- Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 53728–53741.
- Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). A general theoretical paradigm to understand learning from human preferences.
- Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. ArXiv Preprint ArXiv:1707.06347.