Preprint Multi-Agent LLMs Credit Assignment

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Zhongyi Li1, Wan Tian2, Jinju Chen3, Huiming Zhang1, Yang Liu2, Yikun Ban1, Fuzhen Zhuang1

1Beihang University    2Peking University    3Beijing University of Posts and Telecommunications

TL;DR

CCPO and SEPO turn a single collaborative outcome into role-specific learning signals for multi-agent LLM reasoning, improving dual-agent mathematical reasoning in several GRPO and GSPO settings.

Abstract

Collaborative multi-agent large language models can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce two optimizer-agnostic credit assignment algorithms for converting joint outcomes into agent-specific learning signals. Counterfactual Credit Policy Optimization (CCPO) estimates an agent's marginal contribution by comparing the realized joint outcome with a counterfactual outcome where that agent is removed. Self-Evaluated Policy Optimization (SEPO) uses constrained self- and peer-evaluations as a verifier-anchored credit signal while keeping the external task outcome dominant.

Motivation

In a sequential Think-Solve collaboration, a Thinker produces an intermediate reasoning trace and a Solver produces the final answer. A shared verifier reward tells the team whether the answer is correct, but not which role helped, harmed, or was redundant. This mismatch can reinforce unhelpful behavior and fails to capture role asymmetry in collaborative LLM training.

Problem

Sparse joint rewards do not identify individual contributions in long textual trajectories.

Goal

Produce role-specific rewards that remain anchored to an external verifier.

Interface

Use the resulting rewards with GRPO, GSPO, REINFORCE++, or other policy-gradient optimizers.

CCPO Counterfactual marginal credit
SEPO Verifier-anchored self/peer credit
Think-Solve Role-asymmetric collaboration

Method

Counterfactual Credit Policy Optimization

For agent \(i\), CCPO compares the realized joint reward \(R_{\mathrm{joint}}^{(j)}\) with the counterfactual reward \(R_{\neg i}^{(j)}\) obtained after removing that agent. The margin is used as the raw role-specific credit signal:

$$\Delta_i^{(j)} = R_{\mathrm{joint}}^{(j)} - R_{\neg i}^{(j)}.$$

Positive margins indicate helpful contribution, while non-positive margins indicate redundancy or harm under the corresponding rollout. In the Think-Solve instantiation, removing the Thinker means asking the Solver to answer directly from the prompt.

Self-Evaluated Policy Optimization

SEPO uses constrained self- and peer-evaluations as bounded adjustments around the external verifier reward \(R_{\mathrm{ver}}\). The verifier remains the dominant signal, while self/peer assessments redistribute credit within the team:

$$r_i = \begin{cases} R_{\mathrm{ver}} + \lambda_{\mathrm{credit}}\mathrm{bonus}_i, & R_{\mathrm{ver}}=+1,\\ R_{\mathrm{ver}} - \lambda_{\mathrm{blame}}\mathrm{bonus}_i, & R_{\mathrm{ver}}=-1. \end{cases}$$
1Prompt 2Thinker reasoning 3Solver answer 4Verifier reward 5Role-specific credit

Results

We evaluate two-agent Think-Solve collaboration on mathematical reasoning benchmarks. The main GRPO comparison uses the same protocol, data split, and verifier for an untrained collaborative policy, shared-reward training (ReMA), and CCPO.

MATH500 exact-match accuracy under GRPO.
Model Untrained ReMA / Shared CCPO
qwen2.5-1.5b-instruct 54.00 60.00 61.00
llama3.1-8b-instruct 46.20 51.80 53.40
qwen2.5-7b-instruct 74.40 75.40 77.60
qwen3-4b-base 46.40 78.00 79.40
Reward-design comparison under GSPO.
Model Dataset Untrained Shared CCPO SEPO
qwen2.5-1.5b-instruct MATH500 54.00 57.80 59.20 59.20
AIME25 - 6.670 6.670 6.670
AMC23 32.50 42.50 42.50 35.00
Gaokao2023 41.04 46.23 47.01 46.49
MinervaMath 12.50 16.18 15.44 16.91
olmo3-7b-instruct MATH500 82.60 87.00 84.40 87.00
AIME25 30.00 30.00 30.00 30.00
AMC23 77.50 82.50 80.00 85.00
Gaokao2023 72.73 73.25 73.77 73.25
MinervaMath 27.57 27.91 30.15 26.47
Key observation. CCPO consistently improves MATH500 accuracy over shared-reward training in the reported GRPO experiments, while remaining compatible with other policy-gradient optimizers.

Collaboration Diagnostic

To check whether the Solver uses the Thinker's message, we remove Agent 1 at inference time. Full collaboration outperforms the Agent-1-removed setting for the reported models, suggesting that the learned Solver still benefits from the handoff rather than ignoring the collaborative trace.

Model Full collaboration Agent 1 removed
qwen2.5-1.5b-instruct 61.00 56.40
llama3.1-8b-instruct 53.40 52.00
qwen2.5-7b-instruct 77.60 76.00

Citation

@article{li2026ccpo,
  title={Counterfactual Credit Policy Optimization for Multi-Agent Collaboration},
  author={Li, Zhongyi and Tian, Wan and Chen, Jinju and Zhang, Huiming and Liu, Yang and Ban, Yikun and Zhuang, Fuzhen},
  year={2026}
}