Table of Contents:

Overview

Reinforcement learning from human feedback (RLHF)

Align models (and language modeling objective)with users values to be truthful, non-toxic, and helpful to the user.

predicting the next token on a webpage from the internet != objective “follow the user’s instructions helpfully and safely”.

Often start with supervised fine-tuning, e.g. as done in InstructGPT, Dataset or rankings of model outputs -> RLHF. The goal is to decrease the amount of feedback reuired by several orders of magnitude (Christiano, 2017) to practically train deep RL systems w/ human feedback. We can then solve tasks where we can only recognize the desired behavior, but not necessarily demonstrate it (as would be required for inverse RL). Some of the first RLHF papers were focused on Atari games or MuJoCo robotic tasks, rather than text generation (Christiano, 2017).

Reward model predicts which model output our labelers would prefer. Then use this RM as a reward function and finetune a supervised learning baseline to maximize this reward using the PPO algorithm.

No Free Lunch InstructGPT found that there is an alignment tax: the alignment procedures comes at the cost of lower performance on certain tasks that we may care about, but this may be alleviated by mixing PPO updates with updates to increase the log likelihood of the pretraining distribution (PPO-ptx).

Reward Model

InstructGPT starts from an SFT model and removes the final unembedding layer. This model accepts a prompt and response and outputs a scalar reward.

Given K=4 to K=9 responses to rank, get \(K \choose 2\) comparisons for each prompt shown to a labeler. Train on all \(K \choose 2\) comparisons from each prompt as a single batch element.

Loss function:

\[loss(\theta) = \frac{}{} \mathbb{E}[ \log(\sigma(r_\theta(x,y_w) - r_\theta(x,y_l)) ) ]\]

Bradley-Terry Model

Given a “winner” (w) and “loser” (l), we choose to model our as:

\[p(y_w > y_l) = \frac{ e^{r^*(x,y_w) } }{ e^{r^*(x,y_w) } + e^{r^*(x,y_l) } }\]

We use exponentials to ensure that we can obtain non-negative probabilities.

(Bradley and Terry, 1952) for estimation score functions from pairwise preferences.

Train reward models to predict log-odds (Stiennon, 2020).

PPO

Schulman et al, 2017. Used in InstructGPT

Adds a per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model.

AAC

Direct Preference Optimization (DPO)

DPO (Rafailov, 2023) was a runner-up for Best Paper at NeurIPS 2023. There are now many variants of it, such Identity Preference Optimization (IPO) (Azar, 2023) and MTPO and Online DPO. Let’s start with the definition of a sigmoid:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Let \(A = e^{r^*(x,y_w) }\) and \(B = e^{r^*(x,y_l) }\)

\[\begin{aligned} \frac{ e^A }{ e^A + e^B } &=\frac{ e^A (e^{-A}) }{ (e^A + e^B) (e^{-A}) } =\frac{1}{1 + e^Be^{-A}} =\frac{1}{1 + e^{B-A}} =\frac{1}{1 + e^{-(A-B)}} =\sigma(A-B) \end{aligned}\]

Note that we will model the log-probability, instead of probability, because…

Treating this as a binary classification problem, we end up with Equation 2 of the DPO paper as follows:

\[L = -\mathbb{E}_{x,y_w,y_l \sim \mathcal{D}}\Big[ \log \sigma \Big(r_{\phi}(x,y_w) - r_{\phi}(x,y_l)\Big)\Big]\]

\begin{equation} \pi_r(y\mid x) = \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \end{equation}

where \(Z(x) =\sum_{y}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right)\) is the partition function.

However, we can rearrange the equation above to express the reward function in terms of its corresponding optimal policy \(\pi_r\), the reference policy \(\pi_\text{ref}\), and the unknown partition function \(Z(\cdot)\):

\[\begin{aligned} \pi_r(y\mid x) &= \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ Z(x) \pi_r(y\mid x) &= Z(x) \frac{1}{Z(x)}\pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ Z(x) \pi_r(y\mid x) &= \pi_\text{ref}(y\mid x)\exp\left(\frac{1}{\beta}r(x, y)\right), \\ \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } &= \frac{ \pi_\text{ref}(y\mid x) }{ \pi_\text{ref}(y\mid x) }\exp\left(\frac{1}{\beta}r(x, y)\right), \\ \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } &= \exp\left(\frac{1}{\beta}r(x, y)\right), \\ \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \log \exp\left(\frac{1}{\beta}r(x, y)\right), \\ \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \frac{1}{\beta}r(x, y) \\ \beta \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= \beta \frac{1}{\beta}r(x, y) \\ \beta \log \Bigg( \frac{ Z(x) \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= r(x, y) \\ \beta \log \Bigg( Z(x) \Bigg) + \beta \log \Bigg( \frac{ \pi_r(y\mid x) }{ \pi_\text{ref}(y\mid x) } \Bigg) &= r(x, y) \\ \end{aligned}\]

We obtain:

\begin{equation} r(x,y) =\beta \log \frac{\pi_r(y\mid x)}{\pi_\text{ref}(y\mid x)} + \beta \log Z(x). \end{equation}

Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy $\pi_\theta$. Analogous to the reward modeling approach (i.e. Eq. 1), our policy objective becomes:

\[\begin{aligned} \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}}\left[\log \sigma \left(\beta \log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} - \beta \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}\right)\right]. \\ \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) = -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}}\left[\log \sigma \Bigg( \beta \left(\log \frac{\pi_{\theta}(y_w\mid x)}{\pi_\text{ref}(y_w\mid x)} - \log \frac{\pi_{\theta}(y_l\mid x)}{\pi_\text{ref}(y_l\mid x)}\right) \Bigg) \right]. \\ \end{aligned}\]

Since the log of a product is a sum of logs, we can rewrite as two products as follows:

\[\begin{aligned} \mathcal{L}_\text{DPO}(\pi_{\theta}; \pi_\text{ref}) &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \left( \pi_{\theta}(y_w\mid x)*\frac{1}{\pi_\text{ref}(y_w\mid x)} \right) - \log \left( \pi_{\theta}(y_l\mid x)*\frac{1}{\pi_\text{ref}(y_l\mid x)} \right) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) + \log(\frac{1}{\pi_\text{ref}(y_w\mid x)}) - \log \pi_{\theta}(y_l\mid x) + \log(\frac{1}{\pi_\text{ref}(y_l\mid x)}) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) + \log(\pi_\text{ref}(y_w\mid x)^{-1}) - \log \pi_{\theta}(y_l\mid x) + \log(\pi_\text{ref}(y_l\mid x)^{-1}) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) - \log \pi_\text{ref}(y_w\mid x) - \log \pi_{\theta}(y_l\mid x) - \log \pi_\text{ref}(y_l\mid x) \bigg) \Bigg) \Bigg]. \\ &= -\mathbb{E}_{(x, y_w, y_l)\sim \mathcal{D}} \Bigg[ \log \sigma \Bigg( \beta \bigg( \log \pi_{\theta}(y_w\mid x) - \log \pi_{\theta}(y_l\mid x) - \log \pi_\text{ref}(y_w\mid x) - \log \pi_\text{ref}(y_l\mid x) \bigg) \Bigg) \Bigg]. \\ \end{aligned}\]

This can be implemented simply as (from the DPO codebase):

pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps

if reference_free:
    ref_logratios = 0

logits = pi_logratios - ref_logratios  # also known as h_{\pi_\theta}^{y_w,y_l}

if ipo:
    losses = (logits - 1/(2 * beta)) ** 2  # Eq. 17 of https://arxiv.org/pdf/2310.12036v2.pdf
else:
    # Eq. 3 https://ericmitchell.ai/cdpo.pdf; label_smoothing=0 gives original DPO (Eq. 7 of https://arxiv.org/pdf/2305.18290.pdf)
    losses = -F.logsigmoid(beta * logits)

Derivation

Deriving the Optimum of the KL-Constrained Reward Maximization Objective

We will derive Eq. \ref{eq:op_policy}. Analogously to Eq. \ref{eq:RL}, we optimize the following objective:

\[\max_{\pi} \mathbb{E}_{x\sim \mathcal{D}, y\sim \pi}\bigl[r(x, y)\bigr] - \beta\mathbb{D}_{\textrm{KL}}\bigl[\pi(y|x)||\pi_\text{ref}(y|x)\bigr]\]

under any reward function \(r(x,y)\), reference model \(\pi_\text{ref}\) and a general non-parametric policy class. We now have:

\[\begin{aligned} \max_{\pi} \mathbb{E}_{x\sim \mathcal{D}, y\sim \pi}&\bigl[r(x, y)\bigr] - \beta\mathbb{D}_{\textrm{KL}}\bigl[\pi(y|x)\mid\mid\pi_\text{ref}(y|x)\bigr] \\ &=\max_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[r(x, y) - \beta\log\frac{\pi(y|x)}{\pi_\text{ref}(y|x)}\right] \\ &=\min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi_\text{ref}(y|x)} - \frac{1}{\beta}r(x, y)\right] & \text{Multiply by -1} \\ &=\min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right)} - \log Z(x)\right] \end{aligned}\]

where we have partition function:

\begin{equation} Z(x) = \sum_{y}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right). \end{equation}

Note that the partition function is a function of only \(x\) and the reference policy \(\pi_\text{ref}\), but does not depend on the policy \(\pi\). We can now define

\[\pi^*(y|x) = \frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right),\]

which is a valid probability distribution as \(\pi^*(y\mid x) \geq 0\) for all \(y\) and \(\sum_{y}\pi^*(y\mid x)=1\). Since \(Z(x)\) is not a function of \(y\), we can then re-organize the final objective in Eq \ref{eq:RL_proof} as:

\[\begin{aligned} \min_{\pi} \mathbb{E}_{x\sim \mathcal{D}}\left[\mathbb{E}_{y\sim \pi(y|x)}\left[\log\frac{\pi(y|x)}{\pi^*(y|x)}\right] - \log Z(x)\right]=\\ \min_{\pi}\mathbb{E}_{x\sim\mathcal{D}}\left[\mathbb{D}_{\text{KL}}(\pi(y|x)\mid\mid\pi^*(y|x)) - \log Z(x)\right] \end{aligned}\]

Now, since \(Z(x)\) does not depend on \(\pi\), the minimum is achieved by the policy that minimizes the first KL term. Gibbs’ inequality tells us that the KL-divergence is minimized at 0 if and only if the two distributions are identical. Hence we have the optimal solution:

\begin{equation} \pi(y|x)= \pi^*(y|x) = \frac{1}{Z(x)}\pi_\text{ref}(y|x)\exp\left(\frac{1}{\beta}r(x, y)\right) \end{equation}

for all \(x\in\mathcal{D}\). This completes the derivation.

Deriving the DPO Objective Under the Bradley-Terry Model

\label{app:derivation2} It is straightforward to derive the DPO objective under the Bradley-Terry preference model as we have

\[p^*(y_1\succ y_2|x)=\frac{\exp\left(r^*(x, y_1)\right)}{\exp\left(r^*(x, y_1)\right) + \exp\left(r^*(x, y_2)\right)}\]

In Section \ref{sec:DPO} we showed that we can express the (unavailable) ground-truth reward through its corresponding optimal policy:

\[r^*(x,y) =\beta \log \frac{\pi^*(y|x)}{\pi_\text{ref}(y|x)} + \beta \log Z(x)\]

Substituting Eq. \ref{eq:main_eq_restated} into Eq. \ref{eq:BT_restated} we obtain:

\[\begin{aligned} p^*(y_1\succ y_2|x)&=\frac{\exp\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} + \beta \log Z(x)\right)}{\exp\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} + \beta \log Z(x)\right) + \exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)} + \beta \log Z(x)\right)}\\ &= \frac{1}{1+\exp\left(\beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)}-\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)}\right)} \\&= \sigma\left(\beta \log \frac{\pi^*(y_1|x)}{\pi_\text{ref}(y_1|x)} - \beta \log \frac{\pi^*(y_2|x)}{\pi_\text{ref}(y_2|x)}\right). \end{aligned}\]

The last line is the per-instance loss in Equation~\ref{eq:optimum_model}.

ELO

RLAIF

RL from AI Feedback (Bai et al., 2022b) c

Best-of-N Sampling Strategies

Sampling strategies that select candidate outputs based on their reward value are popular in language model alignment efforts

(Gao et al., 2022)

Best-of-N, or rejection sampling (Touvron et al.,2023), is typically implemented by taking the top-scored generation within a pool of N candidates, or by sampling generations with a probability proportional to their reward value

Liu et al, 2023

Reinforced Self-Training (ReST) (Gulcehre et al, 2023) includes two loops: in the inner loop (Improve), we improve the policy on a fixed dataset and in the outer loop (Grow), we grow the dataset by sampling from the latest policy.

  • Grow: Sampling many output sequences from the current policy. The new dataset of sequences is then scored with a reward function.
  • Improve: The datapoints with the reward above a threshold score are used to update the policy. Finetune the current best policy typically trained with either the supervised learning loss or an offline RL loss on the filtered data (V-MPO, offline actor-critic).

West-of-N

A recent paper suggests an approach named “West-of-N” to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. The name comes from Best-of-N + Worst-of-N = “West”-of-N.

Given a query \(x\), West-of-N self-training generates preference pairs by sampling \(N=64\) responses \(\{y_1,\dots,y_N\}\) to a given query, and taking the best and worst according to a base preference model, to produce \((x, y_+, y_-)\). These pseudo-preference pairs are added back into the reward model training mixture. We thus generate a pseudo-preference dataset.

Interestingly, West-of-N has an effect comparable to or greater than adding a similar amount of human preference data (West-of-N self-training results in up to a ∼2.3% increase in reward model accuracy).

West-of-N proposes to maximize the probability of correctly labeling a pair of on-policy responses to a given query. To further improve the quality of generated preference pairs, these can be filtered based on the confidence of their preference label. The authors measure preference label confidence through the prediction \(P_\theta(y_+ \succ y_- \mid x)\), and only retain West-of-N pairs above a certain quantile

Direct Preference Knowledge Distillation

(Li, 2024)

RLEF

Reinforcement learning from (code) execution feedback (RLEF) (Gehring, 2024) uses grounding in inference-time feedback, in the domain of code synthesis from natural language descriptions. Here, feedback is naturally provided as the result of the execution of generated code in the form of error messages and unit test results. Starting from Llama 3.1 models. The authors structure the task of code synthesis as a multi-turn conversation in which an LLM is repeatedly prompted to generate a code solution to a natural language problem description. employ Proximal Policy Optimization (PPO) to fine-tune a LLM. They also include a KL penalty in their reward signal, acting both as an entropy bonus and as regularization towards the distribution of the LLMs they start from. The authors set the turn limit to allow for 3 LLM attempts at solving each problem.

Datasets for RLHF

  • Reddit TL;DR summarization dataset (Stiennon et al., 2020) The Reddit TL;DR dataset consists of 129k examples of Reddit posts along with human-written summaries, and 64k pairs of model-generated summaries rated by human labelers.

  • the Anthropic Helpful and Harmless question-answering dialogue dataset (Bai et al., 2022a) The AnthropicHH dataset consists of 170k pairs of model-generated responses to a given conversation context, also rated by human labelers for their helpfulness and harmlessness, with each rating dimension representing roughly 70% and 30% of the dataset respectively.

  • the UltraFeedback conversational dataset (Cui et al., 2023). The UltraFeedback dataset 64k prompts, each with four associated responses generated by different models and rated by GPT-4.

GRPO

DeepSeekMath (Shao, 2024)

From PPO to GRPO Proximal Policy Optimization (PPO) (Schulman, 2017) is an actor-critic RL algorithm that is widely used in the RL fine-tuning stage of LLMs (Ouyang, 2022). In particular, it optimizes LLMs by maximizing the following surrogate objective:

\[\mathcal{J}_{PPO}(\theta) = \mathbb{E}{[q \sim P(Q), o \sim \pi_{\theta_{old}}(O|q)]} \frac{1}{|o|} \sum_{t=1}^{|o|} \min \left[ \frac{\pi_\theta(o_{t} | q, o_{<t})}{\pi_{\theta_{old}}(o_{t} | q, o_{<t})} A_{t}, \text{clip} \left( \frac{\pi_\theta(o_{t} | q, o_{<t})}{\pi_{\theta_{old}}(o_{t} | q, o_{<t})}, 1 - \epsilon, 1 + \epsilon \right) A_{t} \right] ,\]

where \(\pi_{\theta}\) and \(\pi_{\theta_{old}}\) are the current and old policy models, and \(q, o\) are questions and outputs sampled from the question dataset and the old policy \(\pi_{\theta_{old}}\), respectively. \(\epsilon\) is a clipping-related hyper-parameter introduced in PPO for stabilizing training. \(A_t\) is the advantage, which is computed by applying Generalized Advantage Estimation (GAE) (Schulman, 2015), based on the rewards \(\{r_{\ge t}\}\) and a learned value function \(V_{\psi}\). Thus, in PPO, a value function needs to be trained alongside the policy model and to mitigate over-optimization of the reward model, the standard approach is to add a per-token KL penalty from a reference model in the reward at each token (Ouyang, 2022), i.e.,

\[r_{t} = r_\phi(q, o_{\le t}) - \beta \log\frac{\pi_{\theta}(o_{t}|q, o_{<t})}{\pi_{ref}(o_{t}|q, o_{<t})},\]

where \(r_\phi\) is the reward model, \(\pi_{ref}\) is the reference model, which is usually the initial SFT model, and \(\beta\) is the coefficient of the KL penalty.

As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reduction. While in the LLM context, usually only the last token is assigned a reward score by the reward model, which may complicate the training of a value function that is accurate at each token.

To address this, Group Relative Policy Optimization (GRPO) obviates the need for additional value function approximation as in PPO, and instead uses the average reward of multiple sampled outputs, produced in response to the same question, as the baseline. More specifically, for each question \(q\), GRPO samples a group of outputs \(\{o_1, o_2, \cdots, o_G\}\) from the old policy \(\pi_{\theta_{old}}\) and then optimizes the policy model by maximizing the following objective:

\[\begin{aligned} \mathcal{J}_{GRPO}(\theta) &= \mathbb{E}{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)]} \\ & \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \left\{ \min \left[ \frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} | q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} | q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right] - \beta \mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right]\right\} , \end{aligned}\]

where \(\epsilon\) and \(\beta\) are hyper-parameters, and \(\hat{A}_{i,t}\) is the advantage calculated based on relative rewards of the outputs inside each group only, which will be detailed in the following subsections.

The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question. Also note that, instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss, avoiding complicating the calculation of \(\hat{A}_{i,t}\). And different from the KL penalty term used in the PPO reward equation, we estimate the KL divergence with the following unbiased estimator (Schulman, 2020):

\[\mathbb{D}_{KL}\left[\pi_{\theta} || \pi_{ref}\right] = \frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q,o_{i,<t})}- \log\frac{\pi_{ref}(o_{i,t}|q,o_{i,<t})}{\pi_{\theta}(o_{i,t}|q,o_{i,<t})} - 1,\]

which is guaranteed to be positive.

Outcome Supervision RL with GRPO Formally, for each question \(q\), a group of outputs \(\{o_1, o_2, \cdots, o_G\}\) are sampled from the old policy model \(\pi_{\theta_{old}}\). A reward model is then used to score the outputs, yielding \(G\) rewards \(\mathbf{r}=\{r_1, r_2, \cdots, r_G\}\) correspondingly. Subsequently, these rewards are normalized by subtracting the group average and dividing by the group standard deviation. Outcome supervision provides the normalized reward at the end of each output \(o_i\) and sets the advantages \(\hat{A}_{i, t}\) of all tokens in the output as the normalized reward, i.e., \(\hat{A}_{i, t} = \widetilde{r}_i = \frac{r_i- \mathrm{mean}(\mathbf{r})}{\mathrm{std}(\mathbf{r})}\), and then optimizes the policy by maximizing the objective defined in the GRPO objective above.

\[\begin{array}{l} \textbf{Algorithm: } \text{Iterative Group Relative Policy Optimization} \\ \hline \textbf{Input: } \text{initial policy model } \pi_{\theta_{\text{init}}}; \text{ reward models } r_\phi; \text{ task prompts } \mathcal{D}; \\ \phantom{\textbf{Input: }} \text{hyperparameters } \epsilon, \beta, \mu \\ \textbf{Output: } \pi_\theta \\ \begin{aligned} & \text{1: policy model } \pi_\theta \leftarrow \pi_{\theta_{\text{init}}} \\ & \text{2: } \textbf{for } \text{iteration} = 1, \dots, I \textbf{ do} \\ & \text{3: } \quad \text{reference model } \pi_{ref} \leftarrow \pi_{\theta} \\ & \text{4: } \quad \textbf{for } \text{step} = 1, \dots, M \textbf{ do} \\ & \text{5: } \quad \quad \text{Sample a batch } \mathcal{D}_b \text{ from } \mathcal{D} \\ & \text{6: } \quad \quad \text{Update the old policy model } \pi_{\theta_{old}} \leftarrow \pi_{\theta} \\ & \text{7: } \quad \quad \text{Sample } G \text{ outputs } \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}} (\cdot \mid q) \text{ for each question } q \in \mathcal{D}_b \\ & \text{8: } \quad \quad \text{Compute rewards } \{r_i\}_{i=1}^{G} \text{ for each sampled output } o_i \text{ by running } r_{\phi} \\ & \text{9: } \quad \quad \text{Compute } \hat{A}_{i,t} \text{ for the } t\text{-th token of } o_i \text{ through group relative advantage estimation.} \\ & \text{10: } \quad \quad \textbf{for } \text{GRPO iteration} = 1, \dots, \mu \textbf{ do} \\ & \text{11: } \quad \quad \quad \text{Update the policy model } \pi_{\theta} \text{ by maximizing the GRPO objective} \\ & \text{12: } \quad \quad \textbf{end for} \\ & \text{13: } \quad \textbf{end for} \\ & \text{14: } \quad \text{Update } r_\phi \text{ through continuous training using a replay mechanism.} \\ & \text{15: } \textbf{end for} \end{aligned} \end{array}\]

GRPO Implementations:

References

  1. Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Training language models to follow instructions with human feedback. 2022. [PDF].

  2. Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei. Deep reinforcement learning from human preferences. 2017. [PDF].

  3. Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano. Learning to summarize from human feedback. 2020. [PDF].

  4. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Proximal Policy Optimization Algorithms. 2017. [PDF]

  5. Ralph Bradley, Milton Terry. Rank analysis of incomplete block designs. 1952. Biometrika.

  6. Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos. A General Theoretical Paradigm to Understand Learning from Human Preferences. [PDF].

  7. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. 2023. [PDF].

  8. Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn. West-of-N: Synthetic Preferences for Self-Improving Reward Models. 2024. [PDF].

  9. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

  10. Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, Jialu Liu. Statistical Rejection Sampling Improves Preference Optimization. [PDF]

  11. Reinforced Self-Training (ReST) for Language Modeling Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, Nando de Freitas. []

  12. Leo Gao, John Schulman, Jacob Hilton. Scaling Laws for Reward Model Overoptimization. 2022. [PDF].

  13. Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, Furu Wei. Direct Preference Knowledge Distillation for Large Language Models. 2024. [PDF].

  14. Jonas Gehring, Kunhao Zheng, Jade Copet, Vegard Mella, Taco Cohen, Gabriel Synnaeve. *RLEF: Grounding Code LLMS in Execution Feedback with Reinforcement Learning. 2024. [PDF]

  15. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. [PDF].

  16. John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. 2015. [PDF]

  17. John Schulman. Approximating KL Divergence. http://joschu.net/blog/kl-approx.html. 2020.