Tri Nguyen

LLM Alignment: DPO vs IPO

Mar 27, 2025

Large language model (LLM) alignment is a crucial step in increasing LLM’s usability and safety. This topic has been received a lot of attentions, resulting in very interesting development over the last 5 years. I’m particularly interested in a line of development including 3 works:

These are very nice build-up as one addresses the previous work’s issues and reduce learning process’ complexity. An overview comparing the three method:

In the first 2 works, both RLHF and DPO’s idea are elegant and simple to grasp. This is however not the case for IPO. When I first read the IPO paper the whole thing screams “ad-hoc” and unnecessarily-complicated ;). Their derivation is a bit confusing, their empirical loss looks counter-intuitive, they don’t even have any LLM alignment experiments. But the fact that IPO method works well empirically (from my own experience as well as from other papers) bothers me quite a lot. After spending some effort to examining this method more carefully, IPO’s idea turns out to be quite nice and very clever. If you don’t get it from skimming through the paper (like I did), I hope this blog post can convince you to have another look at this method.

1. LLM alignment

LLMs obtained after a pre-training step over vast un-labeled datasets possesses an amazingly ability of natural-looking text completions. However, due to the nature of unsupervised training, the resulting models exhibit many undesired behaviors (via the generation), including being unhelpful, biased, sexist, racism, hallucination.

LLM alignment aims to address this issue by steering model’s behavior toward the desired characterizations using following formulation

maximizeπExD[Eyπ[s(x,y)βDkl(ππref)]]\begin{equation} \mathop{\mathrm{\text{maximize}}}_{\pi} \quad \mathop{\mathbb{E}}_{\bm{x} \sim \mathcal{D}} \left[ \mathop{\mathbb{E}}_{\bm{y} \sim \pi} \left[ s(\bm{x}, \bm{y}) - \beta D_{\sf kl}(\pi \mid\mid \pi_{\rm ref})\right] \right] \tag{1} \end{equation}

Here, the so-called score function s(x,y)s(\bm{x}, \bm{y}) is assumed to produce a scalar value indicating how strongly the response y\bm{y} is aligned with the desired characteristic given the prompt x\bm{x}. We wish find a policy (a language model) π\pi such that it retains most of the good text-generation capability of πref\pi_{\rm ref}, and at the same time producing responses y\bm{y} to maximize the score function. The balance between 2 objectives is controlled by β\beta. A too small β\beta might lead to overly optimized π\pi that loses the generally good text generation of πref\pi_{\rm ref}, while too large β\beta prevents π\pi from adjusting toward better alignment. The reference policy πref\pi_{\rm ref} is can be seen as a good initialization.

An obvious barrier in solving (1) is the unknown s(x,y)s(\bm{x}, \bm{y}). It is very non-trivial to hand-craft the score function s(x,y)s(\bm{x}, \bm{y}). Instead, a popular approach is to learn s(x,y)s(\bm{x}, \bm{y}) using pairwise preference datasets. A sample of such dataset consists of a tuple (x,y1,y2,c)(\bm{x}, \bm{y}_1, \bm{y}_2, c) where x\bm{x} is a prompt, y1,y\bm{y}_1, \bm{y} are 2 possible responses (a continuation of prompt x\bm{x}). These three elements are sent to a human annotator, who assigns a label c{1,2}c \in \{ 1,2\} to indicate which response is more preferred with the alignment objective. For instance, c=1c=1 implies that y1\bm{y}_1 is preferred over y2\bm{y}_2, denoting as y1y2\bm{y}_1 \succ \bm{y}_2.

Different methods propose to solve (1) in different ways but they all share the same principle: using the preference data to “infer” the unknown function s(x,y)s(\bm{x}, \bm{y}). As the only available supervised signal is the preference dataset, it is necessary to assume certain specification to relate the unknown score function s(x,y)s(\bm{x}, \bm{y}) and the collected pairwise preference data.

In the following, we will go through different realizations of this relations, leading to 3 different popular techniques: RLHF, DPO, and IPO.

2. RLHF and DPO

2.1 RLHF

The structure of the optimization problem (1) is very much a typical reinforcement learning (RL) problem, except that the score function s(x,y)s(\bm{x}, \bm{y}) is unknown. Naturally, if one can learn s(x,y)s(\bm{x}, \bm{y}), then an off-the-shelf RL method can be deployed to solve (1). This is the very idea proposed by RLHF.

To learn s(x,y)s(\bm{x}, \bm{y}), RLHF assumes the Bradley-Terry (BT) model (Bradley & Terry, 1952) for the preference data generation, which hypothesizes that

Pr(y1y2x)Pr(c=1x,y1,y2)=exp(s(x,y1))exp(s(x,y1))+exp(s(x,y2))=σ(s(x,y1)s(x,y2))\textsf{Pr}\left(\bm{y}_1 \succ \bm{y}_2 \mid \bm{x}\right) \triangleq \textsf{Pr}\left(c=1 \mid \bm{x}, \bm{y}_{1}, \bm{y}_{2}\right) = \dfrac{\exp(s(\bm{x}, \bm{y}_{1}))}{\exp(s(\bm{x}, \bm{y}_{1})) + \exp(s(\bm{x}, \bm{y}_{2}))} = \sigma(s(\bm{x}, \bm{y}_{1}) - s(\bm{x}, \bm{y}_{2}))

Since the score function gives a higher value for a better aligned response, it is more likely that that response is preferred over the other.

We can see that this assumption specifies the relation as mentioned in (2). If we further assume the true s(x,y)s(\bm{x}, \bm{y}) belong to certain class of deep neural networks (such as an LLM-based network), then the true score function can be recovered using MLE:

minimizeθE(x,y1,y2),c  D[Llogistic(σ(sθ(x,y1)sθ(x,y2)),c)](3)\begin{aligned} \tag{3} & \mathop{\mathrm{\text{minimize}}}_{\boldsymbol \theta} \quad -\mathop{\mathbb{E}}_{(\bm{x}, \bm{y}_1, \bm{y}_2), c \; \sim \mathcal{D}} \left[\mathcal{L}_{\rm logistic}\Big(\sigma(s_{\boldsymbol \theta}(\bm{x}, \bm{y}_1) - s_{\boldsymbol \theta}(\bm{x}, \bm{y}_2)), c\Big)\right] \end{aligned}

where

Llogistic(p,c)=I[c=1]logp+I[c=2]log(1p).\mathcal{L}_{\rm logistic}(p, c) = \mathbb{I}[c=1] \log p + \mathbb{I}[c=2] \log \left( 1-p\right).

In practice, sθ(x,y)s_{\boldsymbol \theta}(\bm{x}, \bm{y}) can be parameterized using another large language model. After obtaining the optimal solution θ\boldsymbol \theta^{\star } to (3), the alignment fine-tuning is performed by plugging sθs_{\boldsymbol \theta^{\star }} into (1) and an off-the-shelf RL techniques, such as PPO (Schulman et al., 2017) is invoked to perform the alignment fine-tuning step as a pure RL problem.

This approach while being straightforward, suffers from several technical difficulties:

These challenges are the main motivations for the development of DPO.

2.2 DPO

DPO improves upon RLHF by eliminating the RL step. In particular, DPO’s authors realizes that under the same preference model (BT model), the relation between preference label and score function only depends on the relative difference in score, not the absolute score values. This enables them to use a clever trick to re-parameterize the score function via an optimal policy, effectively eliminate the needs of deploying RL.

Notice that the objective in (1) can be expressed as:

Ex[Eyπ(x)[s(y,x)βDkl(ππref)]]=Ex[Dkl(π1Z(x)πref(yx)exp(β1s(y,x)))]+const,\begin{aligned} \mathop{\mathbb{E}}_{\bm{x}} \left[ \mathop{\mathbb{E}}_{\bm{y} \sim \pi(\cdot \mid \bm{x})} \left[ s(\bm{y}, \bm{x}) - \beta D_{\sf kl}(\pi \mid\mid \pi_{\rm ref})\right] \right] &= \mathop{\mathbb{E}}_{\bm{x}} \left[ D_{\sf kl}(\pi \mid\mid \dfrac{1}{Z(\bm{x})} \pi_{\rm ref}(\bm{y} \mid \bm{x})\exp\left(\beta^{-1} s(\bm{y}, \bm{x})\right)) \right] + \text{const}, \end{aligned}

where Z(x)Z(\bm{x}) is an intractable normalizing factor. As KL-divergence reaches minimum value at 00, this expression suggests an optimal solution π\pi^{\star } as

π(yx)=1Z(x)πref(yx)exp(β1s(y,x)),\pi^{\star }(\bm{y} \mid \bm{x}) = \dfrac{1}{Z(\bm{x})} \pi_{\rm ref}(\bm{y} \mid \bm{x}) \exp \left( \beta^{-1} s(\bm{y}, \bm{x}) \right),

which equivalently implies

s(y,x)=βlogπ(yx)πref(yx)+βlogZ(x).s(\bm{y}, \bm{x}) = \beta \log \dfrac{\pi^{\star }(\bm{y} \mid \bm{x})}{\pi_{\rm ref}(\bm{y} \mid \bm{x})} + \beta \log Z(\bm{x}).

This identity establishes a relation between an arbitrary score function s(x,y)s(\bm{x}, \bm{y}) and a corresponding optimal policy π(yx)\pi^{\star }(\bm{y}\mid \bm{x}) with respect to that score function. It is not very useful by itself due to the intractable factor Z(x)Z(\bm{x}). However, the relative score difference, which is all that matters, is independent of Z(x)Z(\bm{x}):

s(x,y1)s(x,y2)=βlogπ(y1x)πref(y1x)βlogπ(y2x)πref(y2x).(4)\tag{4} s(\bm{x}, \bm{y}_1) - s(\bm{x}, \bm{y}_2) = \beta \log \dfrac{\pi^{\star }(\bm{y}_1\mid \bm{x})}{\pi_{\rm ref}(\bm{y}_1 \mid \bm{x})} - \beta \log \dfrac{\pi^{\star }(\bm{y}_2\mid \bm{x})}{\pi_{\rm ref}(\bm{y}_2 \mid \bm{x})}.

To be more concise, define

hθ(x,y1,y2)=βlogπθ(y1x)πref(y1x)βlogπθ(y2x)πref(y2x).h_{\boldsymbol \theta}(\bm{x}, \bm{y}_1, \bm{y}_2)=\beta \log \dfrac{\pi_{\boldsymbol \theta }(\bm{y}_1\mid \bm{x})}{\pi_{\rm ref}(\bm{y}_1 \mid \bm{x})} - \beta \log \dfrac{\pi_{\boldsymbol \theta }(\bm{y}_2\mid \bm{x})}{\pi_{\rm ref}(\bm{y}_2 \mid \bm{x})}.

then equation (4) can be shorten as

s(x,y1)s(x,y2)=hθ(x,y1,y2).(5)\tag{5} s(\bm{x}, \bm{y}_1) - s(\bm{x}, \bm{y}_2) = h_{\boldsymbol \theta}(\bm{x}, \bm{y}_1, \bm{y}_2).

This condition plays a key in ensuring policy πθ\pi_{\boldsymbol \theta } being the optimal solution to the original alignment formulation (1). Particularly, it is shown in DPO that any policy πθ\pi_{\boldsymbol \theta} satisfying condition (5) is an optimal solution to the alignment formula in (1), up to some trivial ambiguity.

With this insight, DPO proposed to use hθ(x,y1,y2)h_{\boldsymbol \theta}(\bm{x}, \bm{y}_1, \bm{y}_2) to parameterize the relative score difference between 2 responses y1,y2\bm{y}_1, \bm{y}_2. As before, they employ the BT model and use MLE to derive the loss function:

minimizeθE(x,y1,y2),c  D[Llogistic(σ(hθ(x,y1,y2)),c)].(6)\begin{aligned} \tag{6} & \mathop{\mathrm{\text{minimize}}}_{\boldsymbol \theta} \quad -\mathop{\mathbb{E}}_{(\bm{x}, \bm{y}_1, \bm{y}_2), c \; \sim \mathcal{D}} \left[\mathcal{L}_{\rm logistic}\Big(\sigma(h_{\boldsymbol \theta}(\bm{x}, \bm{y}_1, \bm{y}_2)), c\Big)\right]. \end{aligned}

With this parameterization, an optimal solution θ\boldsymbol \theta^{\star } to (6) gives us an optimal policy πθ\pi_{\boldsymbol \theta^{\star }} to (1) simultaneously.

2.3 The Common Theme

In terms of modeling, both RLHF and DPO rely on the same specification:

There are 2 factors in this specification:

The combination of the two enables learning s(x,y)s(\bm{x}, \bm{y}) using preference data.

Drawback.

And these drawback lead to the development of IPO.

3. IPO

Will be posted soon in a new post :))))))

Reference

  1. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
  2. Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
  3. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., & others. (2022). Constitutional ai: Harmlessness from ai feedback. ArXiv Preprint ArXiv:2212.08073.
  4. Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. F. (2020). Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33, 3008–3021.
  5. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 53728–53741.
  6. Azar, M. G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M., & Munos, R. (2023). A general theoretical paradigm to understand learning from human preferences.
  7. Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345.
  8. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. ArXiv Preprint ArXiv:1707.06347.

← Back to all articles