When it comes to language model agents, more minds may not always mean merrier results. Multi-agent reinforcement learning (MARL) promises a flexible path for decomposing and solving complex tasks, but coordinating multiple large language models (LLMs) remains riddled with instability, inefficiency, and memory fragmentation.
Enter JoyAgents-R1, a novel framework that proposes an elegant, scalable solution for jointly evolving heterogeneous LLM agents using Group Relative Policy Optimization (GRPO). Developed by researchers at JD.com, JoyAgents-R1 combines memory evolution, policy optimization, and clever sampling strategies to form a resilient multi-agent architecture capable of matching the performance of larger SOTA models with far fewer parameters.
A Hierarchy of Minds: JoyAgents Architecture
The JoyAgents-R1 system employs a hierarchical multi-agent design:
- A master agent interprets user queries and orchestrates the flow.
- Specialized sub-agents tackle domain-specific tasks (math, QA, function-calling).
- Agents execute reasoning in ReAct-style steps, consulting memories and tools.
Each agent is built on a lightweight 3B Qwen2.5 model, fine-tuned and then evolved via reinforcement learning.
Why GRPO? And Why It Works Better Here
GRPO replaces the traditional critic model with a group-based advantage function, avoiding the usual training instability of actor-critic setups. JoyAgents-R1 expands this by:
- Performing node-wise Monte Carlo sampling across reasoning trajectories.
- Prioritizing agent updates based on marginal reward variance, targeting only the top-K unstable nodes.
This means agents that are “most confused” get the most help, optimizing training efficiency while preserving diversity.
Memory as a Free Lunch
In typical multi-agent systems, memory synchronization lags behind model updates. JoyAgents-R1 solves this by:
- Repurposing GRPO rewards as implicit memory supervision.
- Dynamically updating memory entries based on performance thresholds and temporal decay.
Over time, only high-value memories remain, reducing redundant reasoning and improving response efficiency.
How Good Is It, Really?
Despite being built on 3B models, JoyAgents-R1:
- Beats DeepSeek-V3 and Qwen2.5-32B in E-commerce function-calling.
- Comes close to GPT-4o on collaborative tasks, with just 15B combined parameters.
- Outperforms larger open-source models on both in-domain and out-of-domain ToolBench tasks.
These results speak volumes about smart architecture over sheer model size.
Ablation Insights: What Makes It Tick
An extensive ablation study confirms:
- Reinforcement learning (vs. SFT) improves accuracy by ~25%.
- Using GRPO with top-K updates (not all agents) yields higher performance.
- Memory integration boosts decision-making quality by 10%.
When fewer agents are better, JoyAgents-R1 smartly scales down for individual tasks but shines when collaboration is needed.
Final Thoughts: Many Minds, One Joyful Path
JoyAgents-R1 is more than a performance boost—it’s a philosophical shift in how we train and coordinate multiple LLM agents. By letting agents evolve together, selectively intervene, and remember better, it lays the groundwork for robust multi-agent AI systems in domains from e-commerce to collaborative planning.
Cognaptus: Automate the Present, Incubate the Future