
Binary Rewards and Reinforcement Learning: Fundamental Challenges
Authors
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for improving reasoning in language models, yet models trained with RLVR often suffer from diversity collapse: while single-sample accuracy improves, multi-sample coverage degrades, sometimes falling below the base model. We provide a structural account of this phenomenon grounded in the properties of binary rewards.
Binary rewards create a fundamental degeneracy for policy gradient methods: the set of distributions maximizing expected reward is infinite, with no distinguished element. KL-control resolves this degeneracy by selecting, in the limit $β\to 0$, the filtered model $p_:=a(\cdot\mid\mathcal{Y}1)$ -- the base model conditioned on validity -- which is the unique fully valid distribution closest to the base model in KL divergence. This selection operates through a nontrivial asymmetry: the tilted distribution $p{[β]}\propto a(y),e^{v(y)/β}$ converges to $p_$ in forward KL as $β\to 0$, yet $p_$ cannot serve as a direct optimization target because $\mathrm{KL}(q,|,p_)$ is infinite for any full-support policy $q$. We develop explicit formulas relating the hyperparameter $β$ to the more interpretable target validity rate $μ$.
Under model misspecification -- the typical practical regime -- the pressure to decrease $β$ drives the optimizer toward highly concentrated distributions over a small number of valid outputs, collapsing toward ever fewer as $β$ decreases, rather than toward the filtered model. We illustrate this mechanism on a toy autoregressive experiment and discuss how alternative divergences that target $p_$ directly -- as pursued empirically by \citet{kruszewski_whatever_2026} -- avoid this failure mode by rewarding coverage of $p_$'s support rather than concentration on high-validity outputs.