DS1 spectrogram: REINFORCE++: A Simple and Efficient Approach for Aligning Large Language
  Models

REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models

January 4, 20252501.03262

Authors

Jason Klein Liu,Haotian Xu,Wei Shen,Jian Hu

Abstract

Reinforcement Learning from Human Feedback~(RLHF) plays a crucial role in aligning Large Language Models~(LLMs). The dominant algorithm, Proximal Policy Optimization~(PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead.

To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on prompt-level (local) advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on Global Advantage Normalization. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound, effectively unbiased estimate (whose bias vanishes as batch size increases). We introduce two variants: REINFORCE++, a highly efficient and general algorithm ($k \ge 1$) for general-domain RLHF, and REINFORCE++ /w baseline, a robust group-sampling variant ($k > 1$) for complex reasoning tasks.

Our empirical evaluation demonstrates that each variant shows superior stability and performance in its respective domain, outperforming existing methods and even PPO in complex agentic settings.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.