DS1 spectrogram: Multi-module GRPO: Composing Policy Gradients and Prompt Optimization
  for Language Model Programs

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

2508.04660

Authors

Noah Ziems,Dilara Soylu,Liheng Lai,Chen Qian,Kaiqiang Song

Abstract

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix together multiple LM calls with distinct prompt templates and other tools, and it is not clear how best to leverage GRPO to improve these systems.

We begin to address this challenge by defining mmGRPO, a simple multi-module generalization of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories. We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks against the post-trained LM, and by 5% against prompt optimization on its own.

We open-source mmGRPO in DSPy as the dspy.GRPO optimizer.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.