Table of Contents:

Overview
Reward Model Training
PPO
Direct Preference Optimization (DPO)
DPO: Deriving the Optimum of the KL-Constrained Reward Maximization Objective
DPO: Solving for Reward, Using the Optimal Policy
DPO: Revealing the DPO Loss Function
DPO: Deriving the DPO Objective Under the Bradley-Terry Model
GRPO
Post-Training in the Post-RLHF Era

Overview

Reinforcement learning from human feedback (RLHF) align models (and language modeling objective) with users’ values to be truthful, non-toxic, and helpful to the user, through the use of a trained reward model. Because predicting the next token on a webpage from the internet != objective “follow the user’s instructions helpfully and safely,” additional alignment is necessary.

Often start with supervised fine-tuning, e.g. as done in InstructGPT, Dataset or rankings of model outputs -> RLHF. The goal is to decrease the amount of feedback required by several orders of magnitude (Christiano, 2017) to practically train deep RL systems w/ human feedback. We can then solve tasks where we can only recognize the desired behavior, but not necessarily demonstrate it (as would be required for inverse RL). Some of the first RLHF papers were focused on Atari games or MuJoCo robotic tasks, rather than text generation (Christiano, 2017).

Reward model predicts which model output our labelers would prefer. Then, we use this RM as a reward function and finetune a supervised learning baseline to maximize this reward using the PPO algorithm.

No Free Lunch InstructGPT found that there is an alignment tax: the alignment procedures comes at the cost of lower performance on certain tasks that we may care about. However, RLHF fundamentally is a low alignment tax method, which is good news.

Reward Model Training

Reward models (RM) are trained with a simple binary cross-entropy loss.

Bradley-Terry Model Given a “winner” (w) and “loser” (l), we choose to model our as:

\[p(y_w > y_l) = \frac{ e^{r^*(x,y_w) } }{ e^{r^*(x,y_w) } + e^{r^*(x,y_l) } }\]

We use exponentials to ensure that we can obtain non-negative probabilities. The reward model is trained on the thing the labels actually provide: relative preference. The absolute scale and offset of the reward are not identified by pairwise comparisons alone. Human preferences are thus modeled as depending on relative utility, not absolute utility.

(Bradley and Terry, 1952) for estimation score functions from pairwise preferences.

Train reward models to predict log-odds (Stiennon, 2020).

In (Stiennon et al., 2020), the authors collect pairwise human comparisons between two summaries for the same Reddit post. The reward model training loss function is:

\[loss(\theta) = \frac{}{} \mathbb{E}\Bigg[ \log(\sigma\bigg(r_\theta(x,y_w) - r_\theta(x,y_l)\bigg) ) \Bigg]\]

which is actually just binary cross-entropy, under Bradley-Terry. The model’s raw preference strength is the difference: \(z=r_w - r_l\), but \(z\) can be any real number: −10,0,4.7, etc. To use it as a probability that the human prefers \(y_w\) over \(y_l\), we pass it through a sigmoid, moving the result between 0 and 1.

InstructGPT follows the same idea, but instead of showing labelers only two completions, they show K = 4 to 9 completions and ask labelers to rank them. Those rankings are converted into all pairwise winner/loser comparisons for that prompt.

InstructGPT starts from an SFT model and removes the final unembedding layer. This model accepts a prompt and response and outputs a scalar reward.

Given K=4 to K=9 responses to rank, get \(K \choose 2\) comparisons for each prompt shown to a labeler. Train on all \(K \choose 2\) comparisons from each prompt as a single batch element.

Equivalence to softmax cross-entropy You can also view this loss as ordinary 2-class softmax cross-entropy. Suppose the two “class logits” are the two rewards: \([r_w, r_l]\)

The probability assigned to the preferred answer is:

\[\frac{e^{r_w}}{e^{r_w}+e^{r_l}}\]

The cross-entropy loss (\(\mathrm{BCE}=-[y \log(p) + (1-y)\log(1-p)]\)) for choosing the first item is:

\[-\log \frac{e^{r_w}}{e^{r_w}+e^{r_l}}\]

Divide numerator and denominator by \((e^{r_w})\):

\[-\log \frac{1}{1+e^{r_l-r_w}} = -\log \sigma(r_w-r_l)\]

PPO

From PPO to GRPO Proximal Policy Optimization (PPO) (Schulman, 2017) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs like InstructGPT (Ouyang, 2022). In particular, it optimizes LLMs by maximizing the following surrogate objective:

\[\mathcal{J}_{PPO}(\theta) = \mathbb{E}{[q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)]} \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left[ \frac{\pi_\theta(o_{t} | q, o_{<t})}{\pi_{\theta_{old}}(o_{t} | q, o_{<t})} A_{t}, \text{clip} \left( \frac{\pi_\theta(o_{t} | q, o_{<t})}{\pi_{\theta_{old}}(o_{t} | q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_{t} \right] ,\]

where \(\pi_{\theta}\) and \(\pi_{\theta_{old}}\) are the current and old policy models, and \(q, o\) are questions and outputs sampled from the question dataset and the old policy \(\pi_{\theta_{old}}\), respectively. \(\epsilon\) is a clipping-related hyper-parameter introduced in PPO for stabilizing training. \(A_t\) is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman, 2015), based on the rewards \(\{r_{\ge t}\}\) and a learned value function \(V_{\psi}\). Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a reference model (e.g. SFT model) in the reward at each token (Ouyang, 2022), i.e.,

\[r_{t} = r_\phi(q, o_{\le t}) - \beta \log\frac{\pi_{\theta}(o_{t}|q, o_{<t})}{\pi_{ref}(o_{t}|q, o_{<t})},\]

where \(r_\phi\) is the reward model, \(\pi_{ref}\) is the reference model, which is usually the initial SFT model, and \(\beta\) is the coefficient of the KL penalty.

Direct Preference Optimization (DPO)

DPO (Rafailov, 2023) was a runner-up for Best Paper at NeurIPS 2023. There are now many variants of it, such Identity Preference Optimization (IPO) (Azar, 2023) and MTPO and Online DPO.

The key insight of DPO is that, for the KL-regularized RLHF objective, the optimal policy has a closed-form relationship to the reward function. This lets us rearrange the optimal-policy equation to express the reward function in terms of its corresponding optimal policy \(\pi_r\), the reference policy \(\pi_\text{ref}\), and the unknown partition function.

DPO: Deriving the Optimum of the KL-Constrained Reward Maximization Objective

Several works derived this objective previously, such as (Peng et al, 2019). We provide the derivation below.

Our goal is to optimize the following standard KL-constrained objective:

\[\max_{\pi} \mathbb{E}_{x\sim \mathcal{D}, y\sim \pi}\bigl[r(x, y)\bigr] - \beta\mathbb{D}_{\textrm{KL}}\bigl[\pi(y|x)||\pi_\text{ref}(y|x)\bigr]\]

under any reward function \(r(x,y)\), reference model \(\pi_\text{ref}\) and a general non-parametric policy class. We now have:

\[\begin{aligned} \max_{\pi} \mathbb{E}_{x\sim \mathcal{D}, y\sim \pi}&\bigl[r(x, y)\bigr] - \beta\mathbb{D}_{\textrm{KL}}\bigl[\pi(y|x)\mid\mid\pi_\text{ref}(y|x)\bigr] \\ &=\max_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[r(x, y) - \beta\log\frac{\pi(y|x)}{\pi_\text{ref}(y|x)}\right] \\ &=\min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi_\text{ref}(y|x)} - \frac{1}{\beta}r(x, y)\right] & \text{Multiply by -1} \\ &=\min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right)} - \log Z(x)\right] \end{aligned}\]

where we have partition function:

\begin{equation} Z(x) = \sum_{y}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right). \end{equation}

Note that the partition function is a function of only \(x\) and the reference policy \(\pi_\text{ref}\), but does not depend on the policy \(\pi\). We can now define

\[\pi^*(y|x) = \frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right),\]

which is a valid probability distribution as \(\pi^*(y\mid x) \geq 0\) for all \(y\) and \(\sum_{y}\pi^*(y\mid x)=1\). Since \(Z(x)\) is not a function of \(y\), we can then re-organize the earlier final objective above as:

\[\begin{aligned} \min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\left[\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi^*(y|x)}\right] - \log Z(x)\right]=\\ \min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{D}_{\text{KL}}(\pi(y|x)\mid\mid\pi^*(y|x)) - \log Z(x)\right] \end{aligned}\]

Now, since \(Z(x)\) does not depend on \(\pi\), the minimum is achieved by the policy that minimizes the first KL term. Gibbs’ inequality tells us that the KL-divergence is minimized at 0 if and only if the two distributions are identical. Hence we have the optimal solution:

\begin{equation} \pi(y|x)= \pi^*(y|x) = \frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right) \end{equation}

for all \(x\in\mathcal{D}\).

DPO: Solving for Reward, Using the Optimal Policy

Above, we derived the optimal policy for for the KL-regularized RLHF objective. Because the optimal policy has a closed-form relationship to the reward function, we can rearrange the equation above to express the reward function in terms of its corresponding optimal policy \(\pi_r\), the reference policy \(\pi_\text{ref}\), and the unknown partition function \(Z(\cdot)\):

\[\begin{aligned} \pi_r(y\mid x) &= \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ Z(x) \pi_r(y\mid x) &= Z(x) \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ Z(x) \pi_r(y\mid x) &= \pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } &= \frac{ \pi_\text{ref}(y\mid x) }{ \pi_\text{ref}(y\mid x) }\exp\left(\frac{1}{\beta}r(x, y)\right), \\ \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } &= \exp\left(\frac{1}{\beta}r(x, y)\right), \\ \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \log \exp\left(\frac{1}{\beta}r(x, y)\right), \\ \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \frac{1}{\beta}r(x, y) \\ \beta \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \beta \frac{1}{\beta}r(x, y) \\ \beta \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= r(x, y) \\ \beta \log \Bigg( Z(x) \Bigg) + \beta \log \Bigg( \frac{ \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= r(x, y) \\ \end{aligned}\]

We obtain:

\begin{equation} r(x,y) =\beta \log \frac{\pi_r(y\mid x)}{\pi_\text{ref}(y\mid x)} + \beta \log Z(x). \end{equation}

DPO: Revealing the DPO Loss Function

Next, we’ll plug this reward function definition into the standard reward model training objective. Recall the definition of a sigmoid: \(\sigma(x) = \frac{1}{1 + e^{-x}}\). Let \(A = e^{r^*(x,y_w) }\) and \(B = e^{r^*(x,y_l) }\)

\[\begin{aligned} \frac{ e^A }{ e^A + e^B } &=\frac{ e^A (e^{-A}) }{ (e^A + e^B) (e^{-A}) } =\frac{1}{1 + e^Be^{-A}} =\frac{1}{1 + e^{B-A}} =\frac{1}{1 + e^{-(A-B)}} =\sigma(A-B) \end{aligned}\]

Note that we will model the log-probability, instead of probability, because…

Treating this as a binary classification problem (binary cross entropy), we end up with Equation 2 of the DPO paper as follows:

\[L = -\mathbb{E}_{x,y_w,y_l \sim \mathcal{D}}\Big[ \log \sigma \Big(r_{\phi}(x,y_w) - r_{\phi}(x,y_l)\Big)\Big]\]

Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy \(\pi_\theta\). Our policy objective becomes:

\[\begin{aligned} \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}}\left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} - \beta \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}\right)\right] \\ \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}}\left[\log \sigma \Bigg( \beta \left(\log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} - \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}\right) \Bigg) \right] \\ \end{aligned}\]

Since the log of a product is a sum of logs, we can rewrite as two products as follows:

\[\begin{aligned} \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \left( \pi_{\theta}(y_w\mid x)*\frac{1}{\pi_\text{ref}(y_w\mid x)} \right) - \log \left( \pi_{\theta}(y_l\mid x)*\frac{1}{\pi_\text{ref}(y_l\mid x)} \right) \bigg) \Bigg) \Bigg] \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) + \log(\frac{1}{\pi_\text{ref}(y_w\mid x)}) - \log \pi_{\theta}(y_l\mid x) + \log(\frac{1}{\pi_\text{ref}(y_l\mid x)}) \bigg) \Bigg) \Bigg] \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) + \log(\pi_\text{ref}(y_w\mid x)^{-1}) - \log \pi_{\theta}(y_l\mid x) + \log(\pi_\text{ref}(y_l\mid x)^{-1}) \bigg) \Bigg) \Bigg] \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) - \log \pi_\text{ref}(y_w\mid x) - \log \pi_{\theta}(y_l\mid x) - \log \pi_\text{ref}(y_l\mid x) \bigg) \Bigg) \Bigg] \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) - \log \pi_{\theta}(y_l\mid x) - \log \pi_\text{ref}(y_w\mid x) - \log \pi_\text{ref}(y_l\mid x) \bigg) \Bigg) \Bigg] \\ \end{aligned}\]

This can be implemented simply as (from the DPO codebase):

pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps

if reference_free:
    ref_logratios = 0

logits = pi_logratios - ref_logratios  # also known as h_{\pi_\theta}^{y_w,y_l}

if ipo:
    losses = (logits - 1/(2 * beta)) ** 2  # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf
else:
    # Eq. 3 https://ericmitchell.ai/cdpo.pdf; label_smoothing=0 gives original DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)
    losses = -F.logsigmoid(beta * logits)

DPO: Deriving the DPO Objective Under the Bradley-Terry Model

It is straightforward to derive the DPO objective under the Bradley-Terry preference model as we have

\[p^*(y_1\succ y_2|x)=\frac{\exp\left(r^*(x, y_1)\right)}{\exp\left(r^*(x, y_1)\right) + \exp\left(r^*(x, y_2)\right)}\]

Previously, we showed that we can express the (unavailable) ground-truth reward through its corresponding optimal policy:

\[r^*(x,y) =\beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)\]

Substituting Eq. \ref{eq:main_eq_restated} into Eq. \ref{eq:BT_restated} we obtain:

\[\begin{aligned} p^*(y_1\succ y_2|x)&=\frac{\exp\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} + \beta \log Z(x)\right)}{\exp\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)} + \beta \log Z(x)\right)}\\ &= \frac{1}{1+\exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)}-\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)}\right)} \\&= \sigma\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} - \beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)}\right). \end{aligned}\]

The last line is the per-instance loss in Equation~\ref{eq:optimum_model}.

GRPO

As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token, per DeepSeekMath, (Shao et al, 2024).

To address this, Group Relative Policy Optimization (GRPO) obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question \(q\), GRPO samples a group of outputs \(\{o_1, o_2, \cdots, o_G\}\) from the old policy \(\pi_{\theta_{old}}\) and then optimizes the policy model by maximizing the following objective:

\[\begin{aligned} \mathcal{J}_{GRPO}(\theta) &= \mathbb{E}{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]} \\ & \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right]\right\} , \end{aligned}\]

where \(\epsilon\) and \(\beta\) are hyper-parameters, and \(\hat{A}_{i,t}\) is the advantage calculated based on relative rewards of the outputs inside each group only, which will be detailed in the following subsections.

The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of \(\hat{A}_{i,t}\). And different from the KL penalty term used in the PPO reward equation, we estimate the KL divergence with the following unbiased estimator (Schulman, 2020):

\[\mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right] = \frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}- \log\frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q,o_{i,<t})} - 1,\]

which is guaranteed to be positive.

Outcome Supervision RL with GRPO Formally, for each question \(q\), a group of outputs \(\{o_1, o_2, \cdots, o_G\}\) are sampled from the old policy model \(\pi_{\theta_{old}}\). A reward model is then used to score the outputs, yielding \(G\) rewards \(\mathbf{r}=\{r_1, r_2, \cdots, r_G\}\) correspondingly. Subsequently, these rewards are normalized by subtracting the group average and dividing by the group standard deviation. Outcome supervision provides the normalized reward at the end of each output \(o_i\) and sets the advantages \(\hat{A}_{i, t}\) of all tokens in the output as the normalized reward, i.e., \(\hat{A}_{i, t} = \widetilde{r}_i = \frac{r_i- \mathrm{mean}(\mathbf{r})}{\mathrm{std}(\mathbf{r})}\), and then optimizes the policy by maximizing the objective defined in the GRPO objective above.

\[\begin{array}{l} \textbf{Algorithm: } \text{Iterative Group Relative Policy Optimization} \\ \hline \textbf{Input: } \text{initial policy model } \pi_{\theta_{\text{init}}}; \text{ reward models } r_\phi; \text{ task prompts } \mathcal{D}; \\ \phantom{\textbf{Input: }} \text{hyperparameters } \epsilon, \beta, \mu \\ \textbf{Output: } \pi_\theta \\ \begin{aligned} & \text{1: policy model } \pi_\theta \leftarrow \pi_{\theta_{\text{init}}} \\ & \text{2: } \textbf{for } \text{iteration} = 1, \dots, I \textbf{ do} \\ & \text{3: } \quad \text{reference model } \pi_{ref} \leftarrow \pi_{\theta} \\ & \text{4: } \quad \textbf{for } \text{step} = 1, \dots, M \textbf{ do} \\ & \text{5: } \quad \quad \text{Sample a batch } \mathcal{D}_b \text{ from } \mathcal{D} \\ & \text{6: } \quad \quad \text{Update the old policy model } \pi_{\theta_{old}} \leftarrow \pi_{\theta} \\ & \text{7: } \quad \quad \text{Sample } G \text{ outputs } \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}} (\cdot \mid q) \text{ for each question } q \in \mathcal{D}_b \\ & \text{8: } \quad \quad \text{Compute rewards } \{r_i\}_{i=1}^{G} \text{ for each sampled output } o_i \text{ by running } r_{\phi} \\ & \text{9: } \quad \quad \text{Compute } \hat{A}_{i,t} \text{ for the } t\text{-th token of } o_i \text{ through group relative advantage estimation.} \\ & \text{10: } \quad \quad \textbf{for } \text{GRPO iteration} = 1, \dots, \mu \textbf{ do} \\ & \text{11: } \quad \quad \quad \text{Update the policy model } \pi_{\theta} \text{ by maximizing the GRPO objective} \\ & \text{12: } \quad \quad \textbf{end for} \\ & \text{13: } \quad \textbf{end for} \\ & \text{14: } \quad \text{Update } r_\phi \text{ through continuous training using a replay mechanism.} \\ & \text{15: } \textbf{end for} \end{aligned} \end{array}\]

GRPO Implementations:

Post-Training in the Post-RLHF Era

LLM post-training has broadly evolved from a relatively clean three-stage RLHF recipe into much more, following the work by (Stiennen et al, 2020) and (Ouyang et al, 2022),

\[\text{pretrained LM} \rightarrow \text{SFT} \rightarrow \text{reward model} \rightarrow \text{RL optimization, usually PPO}\]

(Stiennon et al., 2020) applies this recipe to summarization quality on Reddit TL;DR data; (Ouyang et al., 2022) applies it to general instruction following across many user-request types.

RL from verifiable rewards (RLVR)…

RLAIF RL from AI Feedback (Bai et al., 2022b)…

RLEF Reinforcement learning from (code) execution feedback (RLEF) (Gehring, 2024) uses grounding in inference-time feedback, in the domain of code synthesis from natural language descriptions. Here, feedback is naturally provided as the result of the execution of generated code in the form of error messages and unit test results. Starting from Llama 3.1 models. The authors structure the task of code synthesis as a multi-turn conversation in which an LLM is repeatedly prompted to generate a code solution to a natural language problem description. employ PPO to fine-tune a LLM. They also include a KL penalty in their reward signal, acting both as an entropy bonus and as regularization towards the distribution of the LLMs they start from. The authors set the turn limit to allow for 3 LLM attempts at solving each problem.

Best-of-N Sampling Strategies

Sampling strategies that select candidate outputs based on their reward value are popular in language model alignment efforts

(Gao et al., 2022)

Best-of-N, or rejection sampling (Touvron et al.,2023), is typically implemented by taking the top-scored generation within a pool of N candidates, or by sampling generations with a probability proportional to their reward value

Liu et al, 2023

Reinforced Self-Training (ReST) (Gulcehre et al, 2023) includes two loops: in the inner loop (Improve), we improve the policy on a fixed dataset and in the outer loop (Grow), we grow the dataset by sampling from the latest policy.

Grow: Sampling many output sequences from the current policy. The new dataset of sequences is then scored with a reward function.
Improve: The datapoints with the reward above a threshold score are used to update the policy. Finetune the current best policy typically trained with either the supervised learning loss or an offline RL loss on the filtered data (V-MPO, offline actor-critic).

West-of-N

“West-of-N” aims to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. The name comes from Best-of-N + Worst-of-N = “West”-of-N.

Given a query \(x\), West-of-N self-training generates preference pairs by sampling \(N=64\) responses \(\{y_1,\dots,y_N\}\) to a given query, and taking the best and worst according to a base preference model, to produce \((x, y_+, y_-)\). These pseudo-preference pairs are added back into the reward model training mixture. We thus generate a pseudo-preference dataset.

Interestingly, West-of-N has an effect comparable to or greater than adding a similar amount of human preference data (West-of-N self-training results in up to a ∼2.3% increase in reward model accuracy).

West-of-N proposes to maximize the probability of correctly labeling a pair of on-policy responses to a given query. To further improve the quality of generated preference pairs, these can be filtered based on the confidence of their preference label. The authors measure preference label confidence through the prediction \(P_\theta(y_+ \succ y_- \mid x)\), and only retain West-of-N pairs above a certain quantile

Direct Preference Knowledge Distillation (Li, 2024)

Datasets for RLHF

Reddit TL;DR summarization dataset (Völske, 2017), used in (Stiennon et al., 2020) The Reddit TL;DR dataset consists of 129k examples of Reddit posts along with human-written summaries, and 64k pairs of model-generated summaries rated by human labelers.
the Anthropic Helpful and Harmless question-answering dialogue dataset (Bai et al., 2022a) The AnthropicHH dataset consists of 170k pairs of model-generated responses to a given conversation context, also rated by human labelers for their helpfulness and harmlessness, with each rating dimension representing roughly 70% and 30% of the dataset respectively.
the UltraFeedback conversational dataset (Cui et al., 2023). The UltraFeedback dataset 64k prompts, each with four associated responses generated by different models and rated by GPT-4.

References

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Training language models to follow instructions with human feedback. 2022. [PDF].
Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei. Deep reinforcement learning from human preferences. 2017. [PDF].
Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano. Learning to summarize from human feedback. 2020. [PDF].
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. 2017. [PDF]
Ralph Bradley, Milton Terry. Rank analysis of incomplete block designs. 1952. Biometrika.
Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos. A General Theoretical Paradigm to Understand Learning from Human Preferences. [PDF].
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. 2023. [PDF].
Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn. West-of-N: Synthetic Preferences for Self-Improving Reward Models. 2024. [PDF].
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu. Statistical Rejection Sampling Improves Preference Optimization. [PDF]
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas. Reinforced Self-Training (ReST) for Language Modeling. [PDF]
Leo Gao, John Schulman, Jacob Hilton. Scaling Laws for Reward Model Overoptimization. 2022. [PDF].
Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei. Direct Preference Knowledge Distillation for Large Language Models. 2024. [PDF].
Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve. *RLEF: Grounding Code LLMS in Execution Feedback with Reinforcement Learning. 2024. [PDF]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. 2024. [PDF].
John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. 2015. [PDF]
John Schulman. Approximating KL Divergence. http://joschu.net/blog/kl-approx.html. 2020.
Michael Völske, Martin Potthast, Shahbaz Syed, Benno Stein. TL;DR: Mining Reddit to Learn Automatic Summarization. ACL Proceedings of the Workshop on New Frontiers in Summarization, 2017. [PDF].
Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine. Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning. 2019. [PDF].