
Deep dive into Group Relative Policy Optimization (GRPO)
A new way to do RLHF for LLMs
- a refresher on PPO,
- explain how it fits into the LLM training pipeline,
- deep dive into GRPO
1
2
3
4
5
6
7
8
9
10
11
┌───────────────┐
│ High-Quality │
│ Human-Labeled │
│ Data │
└───────────────┘
│
▼
┌───────────────┐
│ Fine-Tuned LLM│
│ (π_SFT) │
└───────────────┘
1
2
3
4
5
6
7
8
9
10
11
┌───────────────┐
│ Pairwise │
│ Comparison │
│ Data │
└───────────────┘
│
▼
┌───────────────┐
│ Reward Model │
│ (R) │
└───────────────┘
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
┌───────────────┐
│ Generate │
│ Responses │
└───────────────┘
│
▼
┌───────────────┐
│ Compute │
│ Rewards │
└───────────────┘
│
▼
┌───────────────┐
│ Optimize │
│ Policy (π_θ) │
└───────────────┘
- The model generates responses to prompts
- These responses are evaluated by the reward model to compute rewards
- The policy is then optimized to maximize these rewards while staying close to the original SFT model's behavior
The "policy model" refers tothe language model itself, which acts as the agent that decides what actions to take (i.e., which tokens to generate) based on a given prompt, essentially representing the strategy for generating text that aligns with human preferences as guided by a reward signal from human feedback. The action space of this policy is all the tokens in the model's vocabulary and the observation space is all possible input token sequences.
As the value function employed in PPO is typically another model of comparable size as the policy model, it brings a substantial memory and computational burden. Additionally, during RL training, the value function is treated as a baseline in the calculation of the advantage for variance reductionThis approach simplifies advantage estimation and reduces memory usage.
The group relative way that GRPO leverages to calculate the advantages, aligns well with the comparative nature of rewards models, as reward models are typically trained on datasets of comparisons between outputs on the same question.
- Efficiency: By eliminating the value function, GRPO reduces memory and computational costs.
- Stability: The group-based advantage estimation and KL divergence integration make training more stable.
- Scalability: GRPO is better suited for large-scale models like DeepSeek-V2 and V3, where resource efficiency is critical.
Any opinions in this post are those of the individual author and may not reflect the opinions of AWS.