DS1 spectrogram: DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

2605.29568

Authors

Bibo Cai,Yufei Zhang,Kai Xiong,Zhouhao Sun,Bing Qin

Abstract

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction.

While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn.

In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn.

Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.