Group Relative Policy Optimization

The GRPO objective is presented in the DeepSeekMath paper (submitted February 2024). I will describe my interpretation of the objective function, and the my views.

\[\begin{split} \mathcal{J}_{GRPO}(\theta) &= \mathbb{E}_{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O\vert q)]} \\ & \frac{1}{G}\sum_{i=1}^G\frac{1}{\vert o_i\vert} \sum_{t=1}^{\vert o_i\vert} \Biggl\{ \min \Biggl[ \frac{\pi_\theta(o_{i,t} \vert q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} \vert q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} \vert q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} \vert q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \Biggr] \\ & \phantom{\frac{1}{G}\sum_{i=1}^G\frac{1}{\vert o_i\vert} \sum_{t=1}^{\vert o_i\vert} \Biggl\{ \min \Biggl[} - \beta \mathbb{D}_{KL}\left[\pi_{\theta} \vert\vert \pi_{ref}\right]\Biggr\} , \end{split}\]

Let me break down this expression, component by component:

Policy $\pi$: The policy is the large language model. In reinforcement learning, the policy refers to the model generating the actions. The large language model is generating the actions by generating tokens.

Policy parameters $\theta$: The parameters of the policy. In the context of language models, $\theta$ represents all the neural network weights (typically billions of parameters).

Current policy $\pi_\theta$: This is the language model parameterized by $\theta$ that we’re currently optimizing. When we write $\pi_\theta(o_{i,t} \vert q, o_{i,<t})$, this represents the probability that the model assigns to generating token $o_{i,t}$ given the question $q$ and all previous tokens (including the ones generated) $o_{i,<t}$ in the sequence.

Old policy $\pi_{\theta_{old}}$: This is the policy that generated the tokens that we will be training on. This starts out the same as the current policy. As we update the parameters of the current policy, we are training the current policy on tokens generated by the old policy.

Reference policy $\pi_{ref}$: In the original GRPO implementation, this refers to the policy before every run of training. See the pseudocode for iterative training.

Objective function $\mathcal{J}_{GRPO}(\theta)$: This is the objective function to be maximized by optimizing parameters $\theta$.

Expectation $\mathbb{E}[\cdot]$: The expectation $\mathbb{E}[\cdot]$ means the average over all possible samples. In practice, we estimate this by averaging over randomly sampled data. Specifically, we sample:

A prompt $q$ from the question distribution: $q \sim P(Q)$
For that prompt, a group of $G$ output sequences from the old policy: $\{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O \vert q)$

We then compute the objective for these samples and average the results.

The $\frac{1}{G}\sum_{i=1}^G$ term: This averages the objective over all $G$ sampled outputs in the group. Each question gets $G$ different rollouts, and we aggregate their contributions equally.

The $\frac{1}{\vert o_i \vert}$ term: This averages over all tokens in sequence $i$. The sequence $o_i$ has length $\vert o_i \vert$, so we sum over time steps $t=1$ to $\vert o_i \vert$ and normalize by sequence length.

The reward $R(o_i)$: The reward represents the quality of the generated sequence $o_i$ for the given question $q$. In the context of mathematical reasoning (as in DeepSeekMath), this is typically computed by a reward model $r_\phi$ that evaluates whether the generated solution is correct. The reward is assigned to the entire sequence (not individual tokens), and could be binary (correct/incorrect) or continuous (quality score). The reward drives the optimization by determining which sequences should be favored.

The advantage $\hat{A}_{i,t}$: This is calculated as the group-relative advantage. For token $t$ in sequence $i$:

\[\hat{A}_{i,t} = \tilde{r}_i = \frac{R(o_i) - \frac{1}{G}\sum_{j=1}^G R(o_j)}{\text{std}(R(o_1), \ldots, R(o_G))}\]

where $R(o_i)$ is the reward for sequence $i$. The advantage is computed by subtracting the mean reward across all $G$ sequences in the group and dividing by the group standard deviation to normalize the advantages. Note that the advantage is constant for all tokens in a sequence (it doesn’t depend on $t$).

The probability ratio:

\[\frac{\pi_\theta(o_{i,t} \vert q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} \vert q, o_{i,<t})}\]

The numerator $\pi_\theta(o_{i,t} \vert q, o_{i,<t})$ is the probability of token $o_{i,t}$ under the current policy $\pi_\theta$ being optimized
The denominator $\pi_{\theta_{old}}(o_{i,t} \vert q, o_{i,<t})$ is the probability of the same token under the old policy $\pi_{\theta_{old}}$ from the previous iteration

Note that if $\pi_{\theta_{old}}$ is already 0.99, the probability ratio cannot increase to 1.1, because probability cannot be more than one.

The clipped objective:

\[\min \left[ \frac{\pi_\theta(o_{i,t} \vert q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} \vert q, o_{i,<t})} \hat{A}_{i,t}, \text{clip} \left( \frac{\pi_\theta(o_{i,t} \vert q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} \vert q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \right) \hat{A}_{i,t} \right]\]

Factoring out the advantage, with abridged notation:

\[\hat{A} \cdot \min \left[ \frac{\pi_\theta}{\pi_{\theta_{old}}}, \text{clip} \left( \frac{\pi_\theta}{\pi_{\theta_{old}}}, 1 - \epsilon, 1 + \epsilon \right) \right]\]

This min operation implements the clipping behavior:

When advantage is positive ($\hat{A} > 0$): We will increase $\pi_\theta$ until the ratio reaches $1 + \epsilon$. In other words, we will increase $\pi_\theta$ until¹ it is $(1 + \epsilon) \times \pi_{\theta_{old}}$.

Example 1: $\pi_{\theta_{old}} = 0.010$, $\epsilon = 0.2$, $\hat{A} = 1.0$

$\pi_{\theta_{old}}$	$\pi_\theta$	Ratio	Clipped ratio	Clipped objective
0.010	0.008	0.800	0.800	0.800
0.010	0.010	1.000	1.000	1.000
0.010	0.012	1.200	1.200	1.200
0.010	0.100	10.000	1.200	1.200
0.010	0.500	50.000	1.200	1.200

Notice that any value of $\pi_\theta \geq 0.012$ has the same objective value (1.200) because they exceed the clip boundary. The clipped objective alone does not constrain how large the probability goes.

Example 2: $\pi_{\theta_{old}} = 0.900$, $\epsilon = 0.2$, $\hat{A} = 1.0$

$\pi_{\theta_{old}}$	$\pi_\theta$	Ratio	Clipped ratio	Clipped objective
0.900	0.800	0.889	0.889	0.889
0.900	0.900	1.000	1.000	1.000
0.900	0.950	1.056	1.056	1.056
0.900	1.000	1.111	1.111	1.111

We have not maximized the objective until $\pi_\theta$ reaches exactly 1.000 (probabilities cannot be more than one). At this point, the ratio is 1.111, which is still below the clip boundary of 1.2. Note that we are not able to reach $(1 + \epsilon) \times \pi_{\theta_{old}}$. Also note that we should not try to achieve exactly 1.0 probability from neural networks as this will involve setting huge weights and hidden states.

When advantage is negative ($\hat{A} < 0$): We will decrease $\pi_\theta$ until the ratio reaches $1 - \epsilon$. In other words, we will decrease $\pi_\theta$ until it is $(1 - \epsilon) \times \pi_{\theta_{old}}$.

Example 3: $\pi_{\theta_{old}} = 0.900$, $\epsilon = 0.2$, $\hat{A} = -1.0$

$\pi_{\theta_{old}}$	$\pi_\theta$	Ratio	Clipped ratio	Clipped objective
0.900	0.900	1.000	1.000	-1.000
0.900	0.800	0.889	0.889	-0.889
0.900	0.720	0.800	0.800	-0.800
0.900	0.500	0.556	0.800	-0.800
0.900	0.100	0.111	0.800	-0.800

Notice that any value of $\pi_\theta \leq 0.720$ has the same objective value (-0.800) because they exceed the clip boundary. Outside the clipping window, the ratio term goes flat and the KL regularizer becomes the only force nudging the probabilities.

The KL term:

The KL divergence is subtracted from the objective as a regularization term:

\[\beta \mathbb{D}_{KL}[\pi_{\theta} \vert\vert \pi_{ref}]\]

This penalizes the new policy from diverging too far from the reference policy. The coefficient $\beta$ controls the strength of this constraint.

The GRPO paper estimates the KL divergence with the unbiased reverse-KL estimator (Schulman 2020):

\[\mathbb{D}_{KL}\left[\pi_{\theta} \vert\vert \pi_{ref}\right] = \frac{\pi_{ref}(o_{i,t}\vert q,o_{i,<t})}{\pi_{\theta}(o_{i,t}\vert q,o_{i,<t})}- \log\frac{\pi_{ref}(o_{i,t}\vert q,o_{i,<t})}{\pi_{\theta}(o_{i,t}\vert q,o_{i,<t})} - 1\]

Example 4: How the KL gradient changes with different reference probabilities at the clip boundary.

Given: $\hat{A} = 0.01$, $\epsilon = 0.20$, $\beta = 0.02$, $\pi_{\theta_{old}} = 0.500$, $\pi_\theta = 0.600 = (1 + \epsilon) \times \pi_{\theta_{old}}$

$\pi_{ref}$	$\pi_{\theta_{old}}$	$\pi_\theta$	$\frac{dJ}{d\pi_\theta}$ (clip term)	$\frac{dJ}{d\pi_\theta}$ (KL)	Total gradient
0.500	0.500	0.600$^-$	+0.020	-0.0056	+0.0144
0.100	0.500	0.600$^-$	+0.020	-0.0278	-0.0078
1e-50	0.500	0.600$^-$	+0.020	-0.0333	-0.0133

For this worked example, you see that the overall gradient at the clip boundary could be positive or negative depending on $\pi_{ref}$. This means that $\pi_{ref}$ may be able to influence whether we update $\pi_\theta$ to at least the clip boundary, or we prefer somewhere short of that.

However, this rarely happens. The advantage has to be as small as 0.01 to allow the KL gradient to slightly overcome the clip gradient. Note that the standard deviation of advantage is normalized to one.

If we assume $\pi_\theta = \pi_{old}$, the gradient on each token is (equation 21 in the GRPO paper)

\[\hat{A}_{i,t} + \beta \left(\frac{\pi_{ref}(o_{i,t} \vert o_{i,<t})} {\pi_{\theta}(o_{i,t} \vert o_{i,<t})} - 1\right)\]

The gradient is zero when

\[\hat{A}_{i,t} = \beta \left(1 - \frac{\pi_{ref}(o_{i,t} \vert o_{i,<t})} {\pi_{\theta}(o_{i,t} \vert o_{i,<t})} \right)\]

If the advantage is positive and larger than $\beta$, there are no values of $\pi_{ref}$ and $\pi_{\theta}$ where the gradient is zero (because you would need the fraction of probabilities to be negative). This means that the KL term does not impose a constraint if the advantage is positive and larger than $\beta$, and we will always want to increase the probability to $(1 + \epsilon) \times \pi_{\theta_{old}}$.

How does the training happen

The GRPO paper provides the following pseudocode for iterative training:

Input: Initial policy model $\pi_{\theta_{\text{init}}}$, Reward models $r_\phi$, Task prompts $\mathcal{D}$, Hyperparameters $\epsilon$, $\beta$, $\mu$

Algorithm:

Policy model $\pi_\theta \leftarrow \pi_{\theta_{\text{init}}}$
For iteration $= 1, \ldots, I$
- Reference model $\pi_{ref} \leftarrow \pi_{\theta}$
- For step $= 1, \ldots, M$
  - Sample a batch $\mathcal{D}_b$ from $\mathcal{D}$
  - Update the old policy model $\pi_{\theta_{old}} \leftarrow \pi_{\theta}$
  - Sample $G$ outputs $\lbrace o_i \rbrace_{i=1}^G \sim \pi_{\theta_{old}} (\cdot \mid q) $ for each question $q \in \mathcal{D}_b$
  - Compute rewards $\lbrace r_i \rbrace_{i=1}^{G}$ for each sampled output $o_i$ by running $r_{\phi}$
  - Compute $\hat{A}_{i,t}$ for the $t$-th token of $o_i$ through group relative advantage estimation
  - For GRPO iteration $= 1, \ldots, \mu$
    - Update the policy model $\pi_{\theta}$ by maximizing the GRPO objective.
      - The gradient coefficient is equation 21 in the GRPO paper.
- Update $r_\phi$ through continuous training using a replay mechanism

Output: $\pi_\theta$

How fast versus how much

As you can see in the pseudocode, you do $\mu$ iterations of GRPO. Each iteration involves computing the gradient (equation 21) and making small updates to the weights in the direction of the gradient. After some iterations, for the tokens that have reached the clip boundary, the $\hat{A}$ term in the gradient no longer applies.

Traditionally in machine learning, we usually think of weights / importance as one variable to define and enforce. For example, you can oversample / undersample a certain subset of data so that your model can achieve better performance. You define the weight of certain data in the dataset so that the model learns on this data more efficiently. For each subsystem, you see this as one scaling knob.

However, there are two different scaling knobs in GRPO - how fast to increase and how much to increase. How fast to increase affects the gradient calculation. How much to increase affects when you stop making the small updates to the weights.

These terms in the objective function affect how fast to increase:

G term: The $\frac{1}{G}\sum_{i=1}^G$ - the gradient coefficient is inversely proportional to the number of rollouts.
O term: The $\frac{1}{\vert o_i \vert}$ - the gradient coefficient is inversely proportional to the length of the rollout.
Size of advantage $\hat{A}$: The gradient coefficient is proportional to the size of the advantage.
Reference model $\pi_{ref}$: The KL term usually nudges the gradient only slightly when $\vert \hat{A} \vert \gg \beta$, but Example 4 shows it can flip the sign when the advantage is tiny compared with the KL weight.

These terms in the objective function affect how much to increase:

Advantage sign $\text{sign}(\hat{A})$: Whether the advantage is positive (increase probability) or negative (decrease probability)
Clip coefficient $\epsilon$: This affects until when we reward an increase or decrease in probability
Reference model $\pi_{ref}$: When the KL weight is comparable to the advantage, it can steer whether the ideal probability stops short of the clip boundaries.

If we have only one GRPO iteration ($\mu = 1$), then how much to increase does not matter. You do one round of backpropagation. The clip function is irrelevant.

If we have many GRPO iterations ($\mu \to \infty$), how fast we increase does not really matter² - because all the probabilities will¹ reach their target values anyway (see footnotes for caveats).

According to the DeepSeekMath paper’s implementation section:

The policy model only has a single update following each exploration stage.

This indicates that $\mu = 1$ in their actual training runs. Hence, I am not sure why they include the clip term in their GRPO formulation.

If $\mu$ is more than 1, there are some implementation details that are open to interpretation.

Is there a stopping condition? Do you run all $\mu$ iterations regardless of whether you can increase the objective function further?
What is the stopping condition? Is this something like all probabilities have reached their clip value? What happens if there are tokens that are never able to reach the clip value (probabilities cannot exceed one). Or do you instead calculate something like the objective function value and stop if it no longer increases?
Before stopping, do you stop computing gradients for some tokens? There are tokens that have reached the clip value and there are tokens that have not reached the clip value. Do you compute loss only for the tokens that have not reached the clip value?

My views on GRPO

The idea of GRPO is inspiring. You should not merely copy what your idol has done. You should not just look at a pair of actions and train yourself to do more of the better action. You want to always improve relative to yourself, with actual actions from yourself. You analyze a group of actions from yourself and determine which are better and which are worse. You train yourself to produce more of the better actions. As you improve, you will produce better actions on average. You will then need to improve from that new baseline. You will always keep improving. The desire to improve is written in code.

I think we should only be concerned with how much to increase.

It is much easier to control how much to increase. You can make a choice on how much to exactly increase, and you can check whether the probabilities of the sequence have indeed increased. When you rollout from the same prompt, you can expect that you are likely to get sequences similar to the preferred sequence.

I do not think the G-term or the O-term should affect how fast to increase or how much to increase. I do not think we should give a smaller weight to individual sequences if it is part of a larger group or if it is longer.

I will drop the KL term. In GRPO’s formulation, the KL term barely affects how fast to increase and how much to increase. I used to have some magical thinking that including the KL term would force the model to continue speaking “English” and not devolve into machine language that somehow gets the correct answer to a math question. I should have done the math earlier and looked at the data.

For the advantage terms - I will normalize it to one, zero, and negative one. If a sequence is better than the average rollout, it is good. If a sequence is worse than the average rollout, it is bad. There are understandably some sequences that are on the borderline; I will give them an advantage of zero.

For the clipping - I think I will manually specify the target probability that I want to train the model to. I will do something like having the probability of the good tokens take a square root and the probability of the bad tokens be squared. I might also need to scale back down the probabilities so that they do not sum to more than one³.

The advantage is still equally applied to all tokens in the sequence. This is like learning all the moves - including the mini-blunders - from a chess match that you have won. We could somehow assign credit to tokens that are better than average, but we need to somehow calculate which tokens are better than average.

Ultimately, improving an LLM is all about generating more of the good tokens and generating less of the bad tokens, relative to the current average. You should look at your data to understand what the good and bad tokens are so you understand what you are training on and whether you have really trained on them.

Footnotes

If your target probability is more than one, you can never reach your target probability. ↩ ↩²
If the gradient coefficient is large (from the terms in “how fast to increase”), you could overshoot the clip value. ↩
Consider three preferred tokens each with a probability of 1/4 and a dispreferred token with a probability of also 1/4. Taking a square root for each of the preferred tokens will give you a total probability of 3/2. ↩