
Deep dive into Group Relative Policy Optimization (GRPO)
A new way to do RLHF for LLMs
Shreyas Subramanian
Amazon Employee
Published Jan 13, 2025
Reinforcement Learning (RL) has become a cornerstone in fine-tuning Large Language Models (LLMs) to align with human preferences. Among the RL algorithms, Proximal Policy Optimization or PPO has been widely adopted due to its stability and efficiency. However, as models grow larger and tasks become more complex, PPO's limitations—such as memory overhead and computational cost—have prompted the development of more advanced methods like Group Relative Policy Optimization (GRPO).
In this blog, we’ll cover the following:
- a refresher on PPO,
- explain how it fits into the LLM training pipeline,
- deep dive into GRPO
Below is a simplified architecture diagram showing where RL (PPO/GRPO) fits into the LLM training pipeline:
1. Supervised Fine-Tuning (SFT)
In the first stage of the process, the model is fine-tuned on a high-quality dataset of human-written demonstrations. Usually this involves a team of human labelers who create a dataset primarily consisting of prompts from other models along with some labeler-written prompts. This initial fine-tuning produces what's called the SFT model (π_SFT), which serves as the baseline for further optimization.
2. Reward Modeling
The second stage involves creating a "reward" model that can assess the quality of model outputs. Usually, a dataset of pairwise comparisons is collected, where human labelers indicated their preferences between different model outputs for the same prompts. This comparison data is used to train a reward model (R) that can predict which outputs humans would prefer. This is a crucial step as it creates an automated way to evaluate model outputs according to human preferences, essentially converting human judgments into a scalar reward signal that can be used for reinforcement learning.
3. RL Optimization (PPO/GRPO)
The final stage uses reinforcement learning to optimize the model's policy (π_θ) using the Proximal Policy Optimization (PPO) algorithm. The process works in three steps:
- The model generates responses to prompts
- These responses are evaluated by the reward model to compute rewards
- The policy is then optimized to maximize these rewards while staying close to the original SFT model's behavior
This process can be iterated over time, with new comparison data being collected from the latest policy to train updated reward models, which are then used to train new policies. The researchers note that while most of their comparison data came from their supervised policies, some also came from their PPO policies, suggesting an iterative refinement process.
This "alignment process" allows the model to follow written instructions based on the preferences of the labelers in the project. Alignment can be used for other purposes as well, like outputting a "more correct" output, or less toxic one.
Lets get clear on the "policy" first as this could mean multiple things, depending on where you come from :D
The "policy model" refers tothe language model itself, which acts as the agent that decides what actions to take (i.e., which tokens to generate) based on a given prompt, essentially representing the strategy for generating text that aligns with human preferences as guided by a reward signal from human feedback. The action space of this policy is all the tokens in the model's vocabulary and the observation space is all possible input token sequences.
Now, PPO is a policy gradient method that optimizes a policy π_ø by maximizing a surrogate objective function (in this case, the approximate human preference/reward function). The key idea is to constrain policy updates to prevent large deviations, ensuring stable training.




GRPO is an RL algorithm designed to address PPO's limitations. It eliminates the need for a value function model by using **group-based advantage estimation** and integrates **KL divergence** directly into the loss function for improved stability. Key Innovations in GRPO inlcude:
GRPO generates multiple responses for each prompt and uses the mean reward of the group as the baseline:

As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reductionThis approach simplifies advantage estimation and reduces memory usage.
GRPO directly incorporates the KL divergence between the current policy and a reference policy (often the SFT model) into the loss function.

The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question.
PPO is a powerful RL algorithm used in LLM training, but it has limitations like memory overhead and instability.
- Efficiency: By eliminating the value function, GRPO reduces memory and computational costs.
- Stability: The group-based advantage estimation and KL divergence integration make training more stable.
- Scalability: GRPO is better suited for large-scale models like DeepSeek-V2 and V3, where resource efficiency is critical.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.