DS1 spectrogram: Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

May 28, 20252505.22257

Authors

Youssef Mroueh,Mattia Rigotti,Jerret Ross,Nicolas Dupuis,Brian Belgodere

Abstract

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage.

In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting.

We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO.

We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.