Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
1Beihang University 2Peking University 3Beijing University of Posts and Telecommunications
CCPO and SEPO turn a single collaborative outcome into role-specific learning signals for multi-agent LLM reasoning, improving dual-agent mathematical reasoning in several GRPO and GSPO settings.
Abstract
Collaborative multi-agent large language models can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce two optimizer-agnostic credit assignment algorithms for converting joint outcomes into agent-specific learning signals. Counterfactual Credit Policy Optimization (CCPO) estimates an agent's marginal contribution by comparing the realized joint outcome with a counterfactual outcome where that agent is removed. Self-Evaluated Policy Optimization (SEPO) uses constrained self- and peer-evaluations as a verifier-anchored credit signal while keeping the external task outcome dominant.
Motivation
In a sequential Think-Solve collaboration, a Thinker produces an intermediate reasoning trace and a Solver produces the final answer. A shared verifier reward tells the team whether the answer is correct, but not which role helped, harmed, or was redundant. This mismatch can reinforce unhelpful behavior and fails to capture role asymmetry in collaborative LLM training.
Problem
Sparse joint rewards do not identify individual contributions in long textual trajectories.
Goal
Produce role-specific rewards that remain anchored to an external verifier.
Interface
Use the resulting rewards with GRPO, GSPO, REINFORCE++, or other policy-gradient optimizers.
Method
Counterfactual Credit Policy Optimization
For agent \(i\), CCPO compares the realized joint reward \(R_{\mathrm{joint}}^{(j)}\) with the counterfactual reward \(R_{\neg i}^{(j)}\) obtained after removing that agent. The margin is used as the raw role-specific credit signal:
Positive margins indicate helpful contribution, while non-positive margins indicate redundancy or harm under the corresponding rollout. In the Think-Solve instantiation, removing the Thinker means asking the Solver to answer directly from the prompt.
Self-Evaluated Policy Optimization
SEPO uses constrained self- and peer-evaluations as bounded adjustments around the external verifier reward \(R_{\mathrm{ver}}\). The verifier remains the dominant signal, while self/peer assessments redistribute credit within the team:
Results
We evaluate two-agent Think-Solve collaboration on mathematical reasoning benchmarks. The main GRPO comparison uses the same protocol, data split, and verifier for an untrained collaborative policy, shared-reward training (ReMA), and CCPO.
| Model | Untrained | ReMA / Shared | CCPO |
|---|---|---|---|
| qwen2.5-1.5b-instruct | 54.00 | 60.00 | 61.00 |
| llama3.1-8b-instruct | 46.20 | 51.80 | 53.40 |
| qwen2.5-7b-instruct | 74.40 | 75.40 | 77.60 |
| qwen3-4b-base | 46.40 | 78.00 | 79.40 |
| Model | Dataset | Untrained | Shared | CCPO | SEPO |
|---|---|---|---|---|---|
| qwen2.5-1.5b-instruct | MATH500 | 54.00 | 57.80 | 59.20 | 59.20 |
| AIME25 | - | 6.670 | 6.670 | 6.670 | |
| AMC23 | 32.50 | 42.50 | 42.50 | 35.00 | |
| Gaokao2023 | 41.04 | 46.23 | 47.01 | 46.49 | |
| MinervaMath | 12.50 | 16.18 | 15.44 | 16.91 | |
| olmo3-7b-instruct | MATH500 | 82.60 | 87.00 | 84.40 | 87.00 |
| AIME25 | 30.00 | 30.00 | 30.00 | 30.00 | |
| AMC23 | 77.50 | 82.50 | 80.00 | 85.00 | |
| Gaokao2023 | 72.73 | 73.25 | 73.77 | 73.25 | |
| MinervaMath | 27.57 | 27.91 | 30.15 | 26.47 |
- CCPO improves MATH500 performance across all evaluated base models in the GRPO setting.
- CCPO is often beneficial on out-of-distribution benchmarks such as AMC23 and MinervaMath.
- SEPO is competitive in selected GSPO settings, supporting the optimizer-agnostic view of the credit signals.
- Gains vary across models and datasets, so credit assignment should be treated as a design axis rather than a universally dominant reward rule.
Collaboration Diagnostic
To check whether the Solver uses the Thinker's message, we remove Agent 1 at inference time. Full collaboration outperforms the Agent-1-removed setting for the reported models, suggesting that the learned Solver still benefits from the handoff rather than ignoring the collaborative trace.
| Model | Full collaboration | Agent 1 removed |
|---|---|---|
| qwen2.5-1.5b-instruct | 61.00 | 56.40 |
| llama3.1-8b-instruct | 53.40 | 52.00 |
| qwen2.5-7b-instruct | 77.60 | 76.00 |
Citation
@article{li2026ccpo,
title={Counterfactual Credit Policy Optimization for Multi-Agent Collaboration},
author={Li, Zhongyi and Tian, Wan and Chen, Jinju and Zhang, Huiming and Liu, Yang and Ban, Yikun and Zhuang, Fuzhen},
year={2026}
}