Opening — Why this matters now

Multi-agent LLM systems are having a moment. From collaborative coding bots to diagnostic committees and AI tutors, orchestration is increasingly the default answer to hard reasoning problems. But there’s an inconvenient truth hiding behind the demos: training multi-agent systems with reinforcement learning is expensive, unstable, and often counterproductive.

This paper proposes a refreshingly heretical alternative. What if agents didn’t learn during training at all — but instead learned from themselves at inference time?

That idea becomes Multi-Agent Test-Time Reinforcement Learning (MATTRL): a framework that replaces weight updates with something cheaper, safer, and surprisingly effective — structured textual experience injected at test time.

Background — The limits of training everything

The recent success of RL-trained reasoning models (think DeepSeek-R1 and its descendants) has reignited interest in reinforcement learning for cognition. Extending this to multi-agent settings seems natural: multiple experts, shared rewards, better outcomes.

In practice, it’s a mess.

Multi-agent RL suffers from two structural problems:

  1. Non-stationarity — agents change while learning together, invalidating each other’s gradients.
  2. Sparse, high-variance rewards — especially painful when reasoning unfolds over long dialogues.

Worse, domain-specific MARL fine-tuning often degrades general capabilities. You win on one benchmark and quietly lose everywhere else.

MATTRL starts from a different premise: don’t touch the weights. Keep the models frozen. Move adaptation entirely to how agents talk, remember, and reuse prior reasoning.

Analysis — What MATTRL actually does

At its core, MATTRL treats collaboration itself as a source of reusable experience.

1. Multi-expert team formation

For each task, a coordinator agent assembles a small team of specialists (medical domains, math roles, pedagogy experts). Roles are grounded in a predefined catalog or generated explicitly for the problem — no free-form role cosplay.

Each agent maintains its own evolving opinion and a convergence flag.

2. Experience-augmented deliberation

Agents debate in bounded rounds. Before responding, each agent retrieves relevant past experiences — distilled snippets from earlier high-quality reasoning turns — and integrates them into its prompt.

These experiences are not examples. They are short, actionable rules such as:

  • “Anchor rankings on hard discriminators first.”
  • “Clarify mechanistic loci before subtype assumptions.”
  • “State uncertainty explicitly when evidence is insufficient.”

Think of them as procedural memory, not demonstrations.

A lightweight meeting step shares incremental updates to prevent redundant chatter and force alignment.

3. Report synthesis and decision

Once agents converge (or hit a turn limit), a coordinator synthesizes the discussion into a final report and answer — optionally consulting the experience pool one last time.

Crucially, the experience pool grows at test time, not during training.

Findings — Does this actually work?

The authors test MATTRL across three domains: medicine, math, and education.

Medicine (RareBench)

Method Hit@1 Hit@3 Hit@5 Hit@10 MRR
MDAgents 0.32 0.49 0.57 0.68 0.46
RareAgents-Refined 0.35 0.49 0.57 0.70 0.47
MATTRL 0.39 0.51 0.61 0.75 0.51

MATTRL delivers stronger top-rank precision and broader diagnostic coverage — without retraining.

Math (Humanity’s Last Exam)

Method Accuracy
Single Agent 0.27
Multi-Agent 0.33
MATTRL 0.36

Deliberation helps. Experience-conditioned deliberation helps more.

Education (Teaching effectiveness)

Method Pre Post Gain
Single Teacher 0.44 0.60 0.16
Multi-Teacher 0.44 0.73 0.29
MATTRL 0.44 0.77 0.33

MATTRL nearly doubles learning gains relative to a single teacher.

Credit assignment — The quiet core of the system

Not all dialogue turns are created equal. MATTRL ranks them using a blend of:

  • Individual utterance quality (scored by an LLM judge)
  • A decayed terminal team reward

The paper compares three attribution schemes:

  • Naive averaging
  • Difference rewards (counterfactual removal of one agent)
  • Shapley-style approximations

The verdict is blunt:

Difference rewards work best.

They isolate decisive contributions without the variance and cost of Shapley estimates, producing sharper experience selection and better top-rank accuracy.

Implications — Why this matters beyond benchmarks

MATTRL quietly reframes what “learning” means for deployed AI systems.

  • No fine-tuning risk: General capabilities stay intact.
  • Domain agility: Systems adapt on the fly to new distributions.
  • Auditability: Experiences are textual, inspectable, and editable.
  • Cost control: Inference-time memory beats training-time compute.

For regulated domains — medicine, finance, education — this is not a minor detail. It’s the difference between deployable systems and research toys.

Perhaps most interestingly, the paper shows that collaboration itself is a data source. Multi-agent systems don’t just solve problems; they generate reusable reasoning assets.

Conclusion — Learning without scars

MATTRL suggests a future where agents improve not by gradient descent, but by remembering how they reasoned well last time.

It’s quieter than training. Less glamorous. And far more practical.

Cognaptus: Automate the Present, Incubate the Future.