Table of Contents:

Overview

Reinforcement learning from human feedback (RLHF)

Align models (and language modeling objective)with users values to be truthful, non-toxic, and helpful to the user.

predicting the next token on a webpage from the internet != objective “follow the user’s instructions helpfully and safely”.

Often start with supervised fine-tuning, e.g. as done in InstructGPT, Dataset or rankings of model outputs -> RLHF. The goal is to decrease the amount of feedback reuired by several orders of magnitude (Christiano, 2017) to practically train deep RL systems w/ human feedback. We can then solve tasks where we can only recognize the desired behavior, but not necessarily demonstrate it (as would be required for inverse RL). Some of the first RLHF papers were focused on Atari games or MuJoCo robotic tasks, rather than text generation (Christiano, 2017).

Reward model predicts which model output our labelers would prefer. Then use this RM as a reward function and finetune a supervised learning baseline to maximize this reward using the PPO algorithm.

No Free Lunch InstructGPT found that there is an alignment tax: the alignment procedures comes at the cost of lower performance on certain tasks that we may care about, but this may be alleviated by mixing PPO updates with updates to increase the log likelihood of the pretraining distribution (PPO-ptx).

Reward Model

InstructGPT starts from an SFT model and removes the final unembedding layer. This model accepts a prompt and response and outputs a scalar reward.

Given K=4 to K=9 responses to rank, get $K \choose 2$ comparisons for each prompt shown to a labeler. Train on all $K \choose 2$ comparisons from each prompt as a single batch element.

Loss function:

\[loss(\theta) = \frac{}{} \mathbb{E}[ \log(\sigma(r_\theta(x,y_w) - r_\theta(x,y_l)) ) ]\]

Bradley-Terry Model

Given a “winner” (w) and “loser” (l), we choose to model our as:

\[p(y_w > y_l) = \frac{ e^{r^*(x,y_w) } }{ e^{r^*(x,y_w) } + e^{r^*(x,y_l) } }\]

We use exponentials to ensure that we can obtain non-negative probabilities.

(Bradley and Terry, 1952) for estimation score functions from pairwise preferences.

Train reward models to predict log-odds (Stiennon, 2020).

PPO

Schulman et al, 2017. Used in InstructGPT

Adds a per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model.

AAC

Direct Preference Optimization (DPO)

DPO (Rafailov, 2023) was a runner-up for Best Paper at NeurIPS 2023.

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Let $A = e^{r^*(x,y_w) }$ and $B = e^{r^*(x,y_l) }$

\[\begin{aligned} \frac{ e^A }{ e^A + e^B } &=\frac{ e^A (e^{-A}) }{ (e^A + e^B) (e^{-A}) } =\frac{1}{1 + e^Be^{-A}} =\frac{1}{1 + e^{B-A}} =\frac{1}{1 + e^{-(A-B)}} =\sigma(A-B) \end{aligned}\]

Note that we will model the log-probability, instead of probability, because…

Treating this as a binary classification problem, we end up with Equation 2 of the DPO paper as follows:

\[L = -\mathbb{E}_{x,y_w,y_l \sim \mathcal{D}}\Big[ \log \sigma \Big(r_{\phi}(x,y_w) - r_{\phi}(x,y_l)\Big)\Big]\]

\begin{equation} \pi_r(y\mid x) = \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \end{equation}

where $Z(x) =\sum_{y}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right)$ is the partition function.

However, we can rearrange the equation above to express the reward function in terms of its corresponding optimal policy $\pi_r$, the reference policy $\pi_\text{ref}$, and the unknown partition function $Z(\cdot)$:

\[\begin{aligned} \pi_r(y\mid x) &= \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ Z(x) \pi_r(y\mid x) &= Z(x) \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ Z(x) \pi_r(y\mid x) &= \pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } &= \frac{ \pi_\text{ref}(y\mid x) }{ \pi_\text{ref}(y\mid x) }\exp\left(\frac{1}{\beta}r(x, y)\right), \\ \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } &= \exp\left(\frac{1}{\beta}r(x, y)\right), \\ \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \log \exp\left(\frac{1}{\beta}r(x, y)\right), \\ \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \frac{1}{\beta}r(x, y) \\ \beta \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \beta \frac{1}{\beta}r(x, y) \\ \beta \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= r(x, y) \\ \beta \log \Bigg( Z(x) \Bigg) + \beta \log \Bigg( \frac{ \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= r(x, y) \\ \end{aligned}\]

We obtain:

\begin{equation} r(x,y) =\beta \log \frac{\pi_r(y\mid x)}{\pi_\text{ref}(y\mid x)} + \beta \log Z(x). \end{equation}

Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy $\pi_\theta$. Analogous to the reward modeling approach (i.e. Eq. 1), our policy objective becomes:

\[\begin{aligned} \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}}\left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} - \beta \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}\right)\right]. \\ \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}}\left[\log \sigma \Bigg( \beta \left(\log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} - \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}\right) \Bigg) \right]. \\ \end{aligned}\]

Since the log of a product is a sum of logs, we can rewrite as two products as follows:

\[\begin{aligned} \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \left( \pi_{\theta}(y_w\mid x)*\frac{1}{\pi_\text{ref}(y_w\mid x)} \right) - \log \left( \pi_{\theta}(y_l\mid x)*\frac{1}{\pi_\text{ref}(y_l\mid x)} \right) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) + \log(\frac{1}{\pi_\text{ref}(y_w\mid x)}) - \log \pi_{\theta}(y_l\mid x) + \log(\frac{1}{\pi_\text{ref}(y_l\mid x)}) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) + \log(\pi_\text{ref}(y_w\mid x)^{-1}) - \log \pi_{\theta}(y_l\mid x) + \log(\pi_\text{ref}(y_l\mid x)^{-1}) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) - \log \pi_\text{ref}(y_w\mid x) - \log \pi_{\theta}(y_l\mid x) - \log \pi_\text{ref}(y_l\mid x) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) - \log \pi_{\theta}(y_l\mid x) - \log \pi_\text{ref}(y_w\mid x) - \log \pi_\text{ref}(y_l\mid x) \bigg) \Bigg) \Bigg]. \\ \end{aligned}\]

This can be implemented simply as (from the DPO codebase):

pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps

if reference_free:
    ref_logratios = 0

logits = pi_logratios - ref_logratios  # also known as h_{\pi_\theta}^{y_w,y_l}

if ipo:
    losses = (logits - 1/(2 * beta)) ** 2  # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf
else:
    # Eq. 3 https://ericmitchell.ai/cdpo.pdf; label_smoothing=0 gives original DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)
    losses = -F.logsigmoid(beta * logits)

Derivation

Deriving the Optimum of the KL-Constrained Reward Maximization Objective

We will derive Eq. \ref{eq:op_policy}. Analogously to Eq. \ref{eq:RL}, we optimize the following objective:

\[\max_{\pi} \mathbb{E}_{x\sim \mathcal{D}, y\sim \pi}\bigl[r(x, y)\bigr] - \beta\mathbb{D}_{\textrm{KL}}\bigl[\pi(y|x)||\pi_\text{ref}(y|x)\bigr]\]

under any reward function $r(x,y)$, reference model $\pi_\text{ref}$ and a general non-parametric policy class. We now have:

\[\begin{aligned} \max_{\pi} \mathbb{E}_{x\sim \mathcal{D}, y\sim \pi}&\bigl[r(x, y)\bigr] - \beta\mathbb{D}_{\textrm{KL}}\bigl[\pi(y|x)\mid\mid\pi_\text{ref}(y|x)\bigr] \\ &=\max_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[r(x, y) - \beta\log\frac{\pi(y|x)}{\pi_\text{ref}(y|x)}\right] \\ &=\min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi_\text{ref}(y|x)} - \frac{1}{\beta}r(x, y)\right] & \text{Multiply by -1} \\ &=\min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right)} - \log Z(x)\right] \end{aligned}\]

where we have partition function:

\begin{equation} Z(x) = \sum_{y}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right). \end{equation}

Note that the partition function is a function of only $x$ and the reference policy $\pi_\text{ref}$, but does not depend on the policy $\pi$. We can now define

\[\pi^*(y|x) = \frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right),\]

which is a valid probability distribution as $\pi^*(y\mid x) \geq 0$ for all $y$ and $\sum_{y}\pi^*(y\mid x)=1$. Since $Z(x)$ is not a function of $y$, we can then re-organize the final objective in Eq \ref{eq:RL_proof} as:

\[\begin{aligned} \min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\left[\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi^*(y|x)}\right] - \log Z(x)\right]=\\ \min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{D}_{\text{KL}}(\pi(y|x)\mid\mid\pi^*(y|x)) - \log Z(x)\right] \end{aligned}\]

Now, since $Z(x)$ does not depend on $\pi$, the minimum is achieved by the policy that minimizes the first KL term. Gibbs’ inequality tells us that the KL-divergence is minimized at 0 if and only if the two distributions are identical. Hence we have the optimal solution:

\begin{equation} \pi(y|x)= \pi^*(y|x) = \frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right) \end{equation}

for all $x\in\mathcal{D}$. This completes the derivation.

Deriving the DPO Objective Under the Bradley-Terry Model

\label{app:derivation2} It is straightforward to derive the DPO objective under the Bradley-Terry preference model as we have

\[p^*(y_1\succ y_2|x)=\frac{\exp\left(r^*(x, y_1)\right)}{\exp\left(r^*(x, y_1)\right) + \exp\left(r^*(x, y_2)\right)}\]

In Section \ref{sec:DPO} we showed that we can express the (unavailable) ground-truth reward through its corresponding optimal policy:

\[r^*(x,y) =\beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)\]

Substituting Eq. \ref{eq:main_eq_restated} into Eq. \ref{eq:BT_restated} we obtain:

\[\begin{aligned} p^*(y_1\succ y_2|x)&=\frac{\exp\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} + \beta \log Z(x)\right)}{\exp\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)} + \beta \log Z(x)\right)}\\ &= \frac{1}{1+\exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)}-\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)}\right)} \\&= \sigma\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} - \beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)}\right). \end{aligned}\]

The last line is the per-instance loss in Equation~\ref{eq:optimum_model}.

IPO

Identity Preference Optimization (IPO) (Azar, 2023)

MTPO

WPO

ELO

RLHF for LLaMa

RLAIF

RL from AI Feedback (Bai et al., 2022b) c

Best-of-N Sampling Strategies

Sampling strategies that select candidate outputs based on their reward value are popular in language model alignment efforts

(Gao et al., 2022)

Best-of-N, or rejection sampling (Touvron et al.,2023), is typically implemented by taking the top-scored generation within a pool of N candidates, or by sampling generations with a probability proportional to their reward value

Liu et al, 2023

Reinforced Self-Training (ReST) (Gulcehre et al, 2023) includes two loops: in the inner loop (Improve), we improve the policy on a fixed dataset and in the outer loop (Grow), we grow the dataset by sampling from the latest policy.

Grow: Sampling many output sequences from the current policy. The new dataset of sequences is then scored with a reward function.
Improve: The datapoints with the reward above a threshold score are used to update the policy. Finetune the current best policy typically trained with either the supervised learning loss or an offline RL loss on the filtered data (V-MPO, offline actor-critic).

West-of-N

A recent paper suggests an approach named “West-of-N” to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. The name comes from Best-of-N + Worst-of-N = “West”-of-N.

Given a query $x$, West-of-N self-training generates preference pairs by sampling $N=64$ responses $\{y_1,\dots,y_N\}$ to a given query, and taking the best and worst according to a base preference model, to produce $(x, y_+, y_-)$. These pseudo-preference pairs are added back into the reward model training mixture. We thus generate a pseudo-preference dataset.

Interestingly, West-of-N has an effect comparable to or greater than adding a similar amount of human preference data (West-of-N self-training results in up to a ∼2.3% increase in reward model accuracy).

West-of-N proposes to maximize the probability of correctly labeling a pair of on-policy responses to a given query. To further improve the quality of generated preference pairs, these can be filtered based on the confidence of their preference label. The authors measure preference label confidence through the prediction $P_\theta(y_+ \succ y_- \mid x)$, and only retain West-of-N pairs above a certain quantile

Direct Preference Knowledge Distillation

(Li, 2024)

RLEF

Reinforcement learning from (code) execution feedback (RLEF) (Gehring, 2024) uses grounding in inference-time feedback, in the domain of code synthesis from natural language descriptions. Here, feedback is naturally provided as the result of the execution of generated code in the form of error messages and unit test results. Starting from Llama 3.1 models. The authors structure the task of code synthesis as a multi-turn conversation in which an LLM is repeatedly prompted to generate a code solution to a natural language problem description. employ Proximal Policy Optimization (PPO) to fine-tune a LLM. They also include a KL penalty in their reward signal, acting both as an entropy bonus and as regularization towards the distribution of the LLMs they start from. The authors set the turn limit to allow for 3 LLM attempts at solving each problem.

Datasets for RLHF

Reddit TL;DR summarization dataset (Stiennon et al., 2020) The Reddit TL;DR dataset consists of 129k examples of Reddit posts along with human-written summaries, and 64k pairs of model-generated summaries rated by human labelers.
the Anthropic Helpful and Harmless question-answering dialogue dataset (Bai et al., 2022a) The AnthropicHH dataset consists of 170k pairs of model-generated responses to a given conversation context, also rated by human labelers for their helpfulness and harmlessness, with each rating dimension representing roughly 70% and 30% of the dataset respectively.
the UltraFeedback conversational dataset (Cui et al., 2023). The UltraFeedback dataset 64k prompts, each with four associated responses generated by different models and rated by GPT-4.

References

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Training language models to follow instructions with human feedback. 2022. [PDF].
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei. Deep reinforcement learning from human preferences. 2017. [PDF].
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano. Learning to summarize from human feedback. 2020. [PDF].
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. 2017. [PDF]
Ralph Bradley, Milton Terry. Rank analysis of incomplete block designs. 1952. Biometrika.
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos. A General Theoretical Paradigm to Understand Learning from Human Preferences. [PDF].
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. 2023. [PDF].
Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn. West-of-N: Synthetic Preferences for Self-Improving Reward Models. 2024. [PDF].
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu. Statistical Rejection Sampling Improves Preference Optimization. [PDF]
Reinforced Self-Training (ReST) for Language Modeling Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas. []
Leo Gao, John Schulman, Jacob Hilton. Scaling Laws for Reward Model Overoptimization. 2022. [PDF].
Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei. Direct Preference Knowledge Distillation for Large Language Models. 2024. [PDF].
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve. *RLEF: Grounding Code LLMS in Execution Feedback with Reinforcement Learning. 2024. [PDF]