DS1 spectrogram: Towards Understanding Specification Gaming in Reasoning Models

Towards Understanding Specification Gaming in Reasoning Models

May 4, 20262605.02269

Authors

Kei Nishimura-Gasparian,Robert McCarthy,David Lindner

Abstract

Specification gaming is a critical failure mode of LLM agents. Despite this, there has been little systematic research into when it arises and what drives it.

To address this, we build and open source a diverse suite of tasks where models can score highly by taking unintended actions. We find that all tested models exploit their specifications at non-negligible rates in most of our eight settings, including five non-coding settings.

We see the highest rates of specification gaming in Grok 4 and the lowest rates in Claude models. We use our evaluation suite to study what drives specification gaming, and find that: 1.

RL reasoning training substantially increases the rate at which models exploit their specifications, 2. Increasing RL reasoning budget has a weakly positive effect on exploit rate, and 3.

Test-time mitigations reduce but do not eliminate the rate of specification gaming. Our results suggest that specification gaming is a fundamental challenge arising from RL reasoning training; we release our evaluation suite to support further work on this problem.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.