The Joy of Many Minds: How JoyAgents-R1 Unleashes the Power of Multi-LLM Reinforcement Learning

When it comes to language model agents, more minds may not always mean merrier results. Multi-agent reinforcement learning (MARL) promises a flexible path for decomposing and solving complex tasks, but coordinating multiple large language models (LLMs) remains riddled with instability, inefficiency, and memory fragmentation.

Enter JoyAgents-R1, a novel framework that proposes an elegant, scalable solution for jointly evolving heterogeneous LLM agents using Group Relative Policy Optimization (GRPO). Developed by researchers at JD.com, JoyAgents-R1 combines memory evolution, policy optimization, and clever sampling strategies to form a resilient multi-agent architecture capable of matching the performance of larger SOTA models with far fewer parameters.

A Hierarchy of Minds: JoyAgents Architecture

The JoyAgents-R1 system employs a hierarchical multi-agent design:

A master agent interprets user queries and orchestrates the flow.
Specialized sub-agents tackle domain-specific tasks (math, QA, function-calling).
Agents execute reasoning in ReAct-style steps, consulting memories and tools.

Each agent is built on a lightweight 3B Qwen2.5 model, fine-tuned and then evolved via reinforcement learning.

Why GRPO? And Why It Works Better Here

GRPO replaces the traditional critic model with a group-based advantage function, avoiding the usual training instability of actor-critic setups. JoyAgents-R1 expands this by:

Performing node-wise Monte Carlo sampling across reasoning trajectories.
Prioritizing agent updates based on marginal reward variance, targeting only the top-K unstable nodes.

This means agents that are “most confused” get the most help, optimizing training efficiency while preserving diversity.

Memory as a Free Lunch

In typical multi-agent systems, memory synchronization lags behind model updates. JoyAgents-R1 solves this by:

Repurposing GRPO rewards as implicit memory supervision.
Dynamically updating memory entries based on performance thresholds and temporal decay.

Over time, only high-value memories remain, reducing redundant reasoning and improving response efficiency.

How Good Is It, Really?

Despite being built on 3B models, JoyAgents-R1:

Beats DeepSeek-V3 and Qwen2.5-32B in E-commerce function-calling.
Comes close to GPT-4o on collaborative tasks, with just 15B combined parameters.
Outperforms larger open-source models on both in-domain and out-of-domain ToolBench tasks.

These results speak volumes about smart architecture over sheer model size.

Ablation Insights: What Makes It Tick

An extensive ablation study confirms:

Reinforcement learning (vs. SFT) improves accuracy by ~25%.
Using GRPO with top-K updates (not all agents) yields higher performance.
Memory integration boosts decision-making quality by 10%.

When fewer agents are better, JoyAgents-R1 smartly scales down for individual tasks but shines when collaboration is needed.

Final Thoughts: Many Minds, One Joyful Path

JoyAgents-R1 is more than a performance boost—it’s a philosophical shift in how we train and coordinate multiple LLM agents. By letting agents evolve together, selectively intervene, and remember better, it lays the groundwork for robust multi-agent AI systems in domains from e-commerce to collaborative planning.

Cognaptus: Automate the Present, Incubate the Future

A Hierarchy of Minds: JoyAgents Architecture#

Why GRPO? And Why It Works Better Here#

Memory as a Free Lunch#

How Good Is It, Really?#

Ablation Insights: What Makes It Tick#

Final Thoughts: Many Minds, One Joyful Path#