Preprint Multi-Agent LLMs Credit Assignment

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

Zhongyi Li¹, Wan Tian², Jinju Chen³, Huiming Zhang¹, Yang Liu², Yikun Ban¹, Fuzhen Zhuang¹

¹Beihang University ²Peking University ³Beijing University of Posts and Telecommunications

TL;DR

CCPO and SEPO turn a single collaborative outcome into role-specific learning signals for multi-agent LLM reasoning, improving dual-agent mathematical reasoning in several GRPO and GSPO settings.

Abstract

Collaborative multi-agent large language models can solve complex reasoning tasks by decomposing roles, but reinforcement learning for such systems is limited by credit assignment: shared terminal rewards obscure individual contributions and can encourage free-riding. We introduce two optimizer-agnostic credit assignment algorithms for converting joint outcomes into agent-specific learning signals. Counterfactual Credit Policy Optimization (CCPO) estimates an agent's marginal contribution by comparing the realized joint outcome with a counterfactual outcome where that agent is removed. Self-Evaluated Policy Optimization (SEPO) uses constrained self- and peer-evaluations as a verifier-anchored credit signal while keeping the external task outcome dominant.

Motivation

In a sequential Think-Solve collaboration, a Thinker produces an intermediate reasoning trace and a Solver produces the final answer. A shared verifier reward tells the team whether the answer is correct, but not which role helped, harmed, or was redundant. This mismatch can reinforce unhelpful behavior and fails to capture role asymmetry in collaborative LLM training.

Problem

Sparse joint rewards do not identify individual contributions in long textual trajectories.

Goal

Produce role-specific rewards that remain anchored to an external verifier.

Interface

Use the resulting rewards with GRPO, GSPO, REINFORCE++, or other policy-gradient optimizers.

CCPO Counterfactual marginal credit

SEPO Verifier-anchored self/peer credit

Think-Solve Role-asymmetric collaboration

Method

Counterfactual Credit Policy Optimization

For agent $i$, CCPO compares the realized joint reward $R_{\mathrm{joint}}^{(j)}$ with the counterfactual reward $R_{\neg i}^{(j)}$ obtained after removing that agent. The margin is used as the raw role-specific credit signal:

$$\Delta_i^{(j)} = R_{\mathrm{joint}}^{(j)} - R_{\neg i}^{(j)}.$$

Positive margins indicate helpful contribution, while non-positive margins indicate redundancy or harm under the corresponding rollout. In the Think-Solve instantiation, removing the Thinker means asking the Solver to answer directly from the prompt.

Self-Evaluated Policy Optimization

SEPO uses constrained self- and peer-evaluations as bounded adjustments around the external verifier reward $R_{\mathrm{ver}}$. The verifier remains the dominant signal, while self/peer assessments redistribute credit within the team:

$$r_i = \begin{cases} R_{\mathrm{ver}} + \lambda_{\mathrm{credit}}\mathrm{bonus}_i, & R_{\mathrm{ver}}=+1,\\ R_{\mathrm{ver}} - \lambda_{\mathrm{blame}}\mathrm{bonus}_i, & R_{\mathrm{ver}}=-1. \end{cases}$$

1Prompt 2Thinker reasoning 3Solver answer 4Verifier reward 5Role-specific credit

Results

We evaluate two-agent Think-Solve collaboration on mathematical reasoning benchmarks. The main GRPO comparison uses the same protocol, data split, and verifier for an untrained collaborative policy, shared-reward training (ReMA), and CCPO.

MATH500 exact-match accuracy under GRPO.
Model	Untrained	ReMA / Shared	CCPO
qwen2.5-1.5b-instruct	54.00	60.00	61.00
llama3.1-8b-instruct	46.20	51.80	53.40
qwen2.5-7b-instruct	74.40	75.40	77.60
qwen3-4b-base	46.40	78.00	79.40

Reward-design comparison under GSPO.
Model	Dataset	Untrained	Shared	CCPO	SEPO
qwen2.5-1.5b-instruct	MATH500	54.00	57.80	59.20	59.20
	AIME25	-	6.670	6.670	6.670
	AMC23	32.50	42.50	42.50	35.00
	Gaokao2023	41.04	46.23	47.01	46.49
	MinervaMath	12.50	16.18	15.44	16.91
olmo3-7b-instruct	MATH500	82.60	87.00	84.40	87.00
	AIME25	30.00	30.00	30.00	30.00
	AMC23	77.50	82.50	80.00	85.00
	Gaokao2023	72.73	73.25	73.77	73.25
	MinervaMath	27.57	27.91	30.15	26.47

Key observation. CCPO consistently improves MATH500 accuracy over shared-reward training in the reported GRPO experiments, while remaining compatible with other policy-gradient optimizers.

CCPO improves MATH500 performance across all evaluated base models in the GRPO setting.
CCPO is often beneficial on out-of-distribution benchmarks such as AMC23 and MinervaMath.
SEPO is competitive in selected GSPO settings, supporting the optimizer-agnostic view of the credit signals.
Gains vary across models and datasets, so credit assignment should be treated as a design axis rather than a universally dominant reward rule.

Collaboration Diagnostic

To check whether the Solver uses the Thinker's message, we remove Agent 1 at inference time. Full collaboration outperforms the Agent-1-removed setting for the reported models, suggesting that the learned Solver still benefits from the handoff rather than ignoring the collaborative trace.

Model	Full collaboration	Agent 1 removed
qwen2.5-1.5b-instruct	61.00	56.40
llama3.1-8b-instruct	53.40	52.00
qwen2.5-7b-instruct	77.60	76.00

Citation

@article{li2026ccpo,
  title={Counterfactual Credit Policy Optimization for Multi-Agent Collaboration},
  author={Li, Zhongyi and Tian, Wan and Chen, Jinju and Zhang, Huiming and Liu, Yang and Ban, Yikun and Zhuang, Fuzhen},
  year={2026}
}