The Joy of Many Minds: How JoyAgents-R1 Unleashes the Power of Multi-LLM Reinforcement Learning

TL;DR for operators

A naming note before the machinery starts: the existing Cognaptus title says JoyAgents-R1, but the arXiv paper itself names the benchmark HiMA-Ecom and the training method HiMA-R1. This revision uses the paper’s terminology, because accuracy is not decorative trim.

The paper is useful for operators because it does not simply say “use more agents.” That slogan is old, cheap, and usually followed by a demo in which three chatbots politely agree with one another until the invoice arrives. The real contribution is more specific: the authors build a hierarchical e-commerce assistant benchmark, then train the master agent and specialised sub-agents jointly with reinforcement learning instead of optimising them as isolated prompt puppets.¹

HiMA-Ecom contains 22.8K instances, including 17.7K memory-bearing samples. Its architecture has a master agent plus specialised sub-agents for question answering, e-commerce function calling, general function calling, and math. The reinforcement-learning split is deliberately harsher than agent-level supervision: each RL example gives only the user query and final system response, not a neat transcript of which agent should have done what. That matters because real operations logs rarely arrive as beautifully annotated agent choreography. Reality, being inconsiderate, tends to provide outcomes.

HiMA-R1 then tackles the ugly part of multi-agent RL: if every agent samples multiple actions across a reasoning path, the joint action space can explode. The paper’s answer is Variance-Reduction GRPO, or VR-GRPO. It samples around an initial trajectory rather than exhaustively exploring every branch, then updates the agents with the largest reward variance. In plainer operational English: do not retrain every worker after every ticket; identify the parts of the workflow that caused the most performance instability and spend update budget there.

The results are strongest where the benchmark resembles the intended product environment: e-commerce function calling, tool routing, and collaborative assistant workflows. HiMA-R1 with a 7B master plus 3B sub-agents reaches 47.6% average accuracy on HiMA-Ecom, matching DeepSeek-R1’s reported average and exceeding DeepSeek-V3’s 41.6% in that table. A smaller 3B master plus 3B sub-agent version reaches 44.0%, outperforming a single Qwen2.5-14B model trained with RL at 35.1%. That is not a universal victory parade for tiny models. It is a domain-specific lesson: structured agent teams trained on relevant workflows can beat bigger generalists on the parts of the job where domain routing and tool choice matter.

For business use, the paper points toward an architecture for merchant support, internal service desks, marketplace operations, and API-heavy customer support: specialised agents, shared outcome rewards, memory that evolves with the policy, and evaluation that includes collaboration rather than only isolated question answering. The uncertainty is equally clear. The work is still mostly an e-commerce and tool-calling story, trained on small-to-medium Qwen2.5 backbones, with infrastructure assumptions that include H200 GPUs and carefully constructed data. Operators should read it as a serious design pattern, not a procurement memo.

The expensive problem is not “more agents”; it is coordinating them

Multi-agent AI sounds deceptively simple. Give one model the role of planner, another the role of tool caller, another the role of domain expert, and perhaps a fourth the role of mathematical adult supervision. Then watch them collaborate. Very tidy. Very enterprise. Very likely to go sideways unless the system is trained and evaluated as a system.

The paper starts from a practical observation: vertical AI assistants are already hierarchical in spirit. A customer support assistant does not merely answer text. It recognises intent, retrieves policy, selects APIs, checks order status, calculates fees, and decides when enough information has been collected. In e-commerce, this means a single user request can move from merchant policy to platform API selection to arithmetic and finally to customer-facing response composition.

A monolithic model can attempt all of this. Sometimes it will even succeed, especially when the platform has the budget to throw a frontier model at every interaction. But the operational question is different: can a business train a smaller, modular assistant that routes tasks to specialised sub-agents, learns from final task outcomes, and improves the orchestration rather than just the wording?

HiMA-Ecom exists because ordinary benchmarks do not answer that question. A benchmark for single-turn Q&A cannot tell whether the master agent routed correctly. A function-calling benchmark alone cannot tell whether the system knew when to retrieve policy first. A math benchmark cannot test whether a pricing query should be passed to a math agent only after an e-commerce API returns the relevant deposit amount. The paper’s benchmark tries to capture this full choreography.

That is why a mechanism-first reading is better than a leaderboard-first reading. The scores matter, but they are not the main plot. The main plot is how the authors make hierarchical agent learning tractable enough to train and realistic enough to evaluate.

HiMA-Ecom turns e-commerce support into a trainable agent system

HiMA-Ecom is built around a master-sub-agent structure. The master agent analyses the user query, invokes specialised agents or tools, and decides when to produce the final response. The sub-agents handle narrower roles: question answering, e-commerce function calling, general function calling, and math.

The dataset has two layers:

Data layer	What it contains	Operational role
Agent-specific SFT data	21.7K supervised instances across master, QA, e-commerce function-call, general function-call, and math agents	Teaches each agent its basic role, format, memory use, and tool behaviour
System-level RL data	1.1K collaboration/RL instances, split into 600 training and 500 test samples	Trains the whole multi-agent system from initial query to final answer without intermediate trajectory labels
Memory-bearing data	17.7K samples include memory	Lets agents learn when to reuse prior solutions and when to ignore misleading memory

This design is quietly important. The SFT portion teaches each role. The RL portion then removes the comfort blanket of intermediate supervision. During RL, the system receives only the initial user query and the final response target. It must discover the useful internal route itself.

That resembles deployment more closely than many agent demos do. In a real support operation, the final ticket resolution is easier to observe than the perfect hidden reasoning path. You may know whether a merchant received the correct answer about deposits, order labels, or settlement configuration. You may not know whether the internal agent should have called the retrieval tool before the e-commerce API. HiMA-Ecom turns that messy operational reality into a training problem.

The benchmark also includes examples that force collaboration. One appendix case asks about the deposit required to open one small personal store selling educational toys, then asks the total for two stores. The system must first invoke an e-commerce function-call agent to check the deposit and then invoke the math agent to multiply it. This is the paper in miniature: value appears when role boundaries are crossed correctly.

VR-GRPO trains the bottleneck, not the whole committee

The paper’s core method, HiMA-R1, adapts Group Relative Policy Optimization to hierarchical multi-agent systems. Standard GRPO avoids a critic model by comparing multiple sampled outputs within a group and computing relative advantages from rewards. That is already attractive for LLM training because it avoids maintaining a separate value model.

The difficulty is that multi-agent systems multiply the sampling problem. If every agent can take several possible actions at each step across a trajectory, naïve sampling does not grow politely. It explodes.

HiMA-R1’s answer is initial trajectory-based Monte Carlo sampling. First, the system generates an initial reasoning trajectory. Then it samples alternative actions at nodes along that trajectory while keeping the surrounding path anchored to the original trajectory. Instead of enumerating the full joint action space, it probes the trajectory locally.

This is not merely a computational trick. It changes what the training process is allowed to notice. The system can ask, in effect: if this agent had acted differently at this point, would the final answer have improved? That gives a way to assign learning pressure to specific parts of the route without needing a fully annotated trajectory.

The second move is marginal benefit-driven updating. HiMA-R1 does not update every participating agent equally. It selects the top-$k$ agents or nodes with the largest reward variance. High variance means the sampled alternatives produced meaningfully different outcomes. Those are the places where learning is likely to pay off.

For an operator, this is the most transferable idea in the paper. Training budget should follow workflow uncertainty. If the math agent is stable but the e-commerce API selector keeps swinging between correct and incorrect tools, update the API selector. If the master agent repeatedly misroutes collaborative tasks, update the master. Blindly improving every component is not strategy. It is just computational gardening.

The reward is doing three jobs at once

HiMA-R1’s reward combines three signals: accuracy, format, and efficiency.

Accuracy is task-specific. Math uses exact numerical matching. Question answering and function-calling rely on semantic or name-level matching appropriate to the task. Format reward pushes the agents to produce structured reasoning and tool-call outputs in the expected style. Efficiency reward penalises longer downstream decision chains, encouraging the system to solve tasks without unnecessary tool calls.

That last term deserves attention. In customer operations, a correct answer that takes five unnecessary API calls is not equally good. It is slower, costlier, and more fragile. The paper’s efficiency reward encodes a version of that business reality into training. The agent is not only asked, “Did you answer correctly?” It is also asked, “Did you make the machinery work harder than necessary?”

The ablation study supports this interpretation. Removing efficiency reward yields 42.4% accuracy and requires 1,750 update steps, while the full version reaches 44.0% accuracy in 1,112 update steps. The difference is not spectacular theatre, but it is operationally meaningful: less training churn, slightly better accuracy, and a system that is being nudged away from gratuitous reasoning rounds.

There is a useful caution here. Efficiency reward is not the same as making agents terse. A prematurely terminating system can look efficient because it stops thinking early. The paper’s own ablation notes that the weakest baseline uses fewer inference rounds partly because it lacks enough reasoning capacity. In other words, fewer steps are good only when the task is still solved. Otherwise it is just failure with excellent punctuality.

Memory is not a notebook; it is part of the policy loop

Most production agent systems treat memory as a sidecar. The model changes, prompts change, tools change, but the memory store lingers like an old shared drive full of files named “final_v7_real_final.” Eventually, the agent retrieves something that used to be useful and is now quietly harmful.

HiMA-R1 tries to make memory evolve with training. Its memory mechanism uses GRPO rewards as supervisory signals. Strong trajectories can create or reinforce memory entries. Weak trajectories can penalise or eventually delete them. Recalled memories decay over time unless reinforced. The memory buffer is therefore not a static archive; it becomes a performance-weighted operational asset.

The paper calls this a “free lunch” because the reward signal already exists for GRPO training. No separate memory trainer is required. That phrase is a little optimistic — there is still engineering, storage design, retrieval design, and threshold tuning — but the economic point is fair. If rewards already indicate which trajectories worked, those rewards can also help decide which memories deserve to survive.

The case analysis makes the mechanism concrete. In one example, a math agent retrieves a similar memory but wrongly reuses it despite different numerical values. The bad memory is then penalised and becomes less likely to persist. In another case, the agent retrieves a genuinely matching function-call memory and directly reuses the successful tool pattern. This is the right distinction. Memory should reduce repeated reasoning when the situation is structurally identical, not seduce the model into lazy analogy when the numbers changed.

The ablation evidence is also relevant. Removing memory lowers accuracy from 44.0% to 40.0% and increases update steps from 1,112 to 2,464. The training curve comparison reports faster peak performance with memory: the memory-enabled system peaks at step 140, while the version without memory peaks at step 168. That supports the paper’s narrower claim: memory helps convergence and performance in this benchmark. It does not prove that memory will behave safely in every enterprise knowledge base. Please do not give the intern write access to the memory deletion thresholds and call it governance.

The evidence: strong where orchestration matters, weaker where scale still dominates

The main HiMA-Ecom comparison is easy to misread. The tempting headline is: small multi-agent system beats giant models. That is too blunt.

The more accurate version is: on this e-commerce-oriented hierarchical benchmark, HiMA-R1’s specialised multi-agent setup changes the cost-performance trade-off, especially on function-calling and collaborative tasks.

On HiMA-Ecom, the reported averages are:

Model or method	Average accuracy	What the comparison mainly tests
GPT-4o	53.0%	Closed-source frontier reference
DeepSeek-R1	47.6%	Larger reasoning-model reference
DeepSeek-V3	41.6%	Larger generalist reference
Qwen2.5-32B	37.4%	Larger open-source single-agent baseline
Qwen2.5-14B with RL	35.1%	Single-agent RL baseline
HiMA-SFT, 7B master + 3B sub-agents	39.2%	Hierarchical system without RL joint evolution
HiMA-R1, 3B master + 3B sub-agents	44.0%	Compact jointly trained multi-agent system
HiMA-R1, 7B master + 3B sub-agents	47.6%	Larger-master jointly trained multi-agent system

The strongest domain-specific result is e-commerce function calling. HiMA-R1 reaches up to 48.0% there, compared with 24.0% for DeepSeek-R1 and 10.0% for DeepSeek-V3 in the reported table. This is exactly where one would expect specialised agents and domain-specific tool training to matter. The system knows the operational surface: platform-style APIs, merchant-support logic, and the difference between retrieving policy and calling a tool.

Math is different. GPT-4o, DeepSeek-R1, and DeepSeek-V3 remain much stronger than the smaller HiMA-R1 configurations on math. That is not a failure of the paper; it is a boundary. Multi-agent structure does not magically compensate for a small specialised math agent in all settings. Architecture helps when the problem is coordination, routing, and domain tool choice. Raw reasoning scale still matters in some subdomains.

The ToolBench results extend the argument beyond the paper’s own e-commerce benchmark. On in-domain ToolBench, HiMA-R1 with 3B agents reports a 51.2% average, above DeepSeek-R1’s 47.1%, DeepSeek-V3’s 43.1%, and Qwen2.5-32B’s 46.8%, though still below GPT-4o’s 55.0%. Out of domain, HiMA-R1 reaches 57.5%, close to GPT-4o’s 59.3% and above DeepSeek-R1’s 51.8% and DeepSeek-V3’s 54.2%.

The point is not that HiMA-R1 is now the universal champion of tool use. The point is that fine-tuned, coordinated smaller agents can be unusually competitive in API-routing environments. That is the business-relevant wedge.

The ablations explain the mechanism better than the leaderboard does

The ablation table is not a side dish. It is where the paper explains which parts of the system actually matter.

Test	Likely purpose	What it supports	What it does not prove
SFT without explicit thinking vs SFT with thinking	Ablation	Structured reasoning output matters for agent decision-making; accuracy rises from 17.2% to 35.0%	It does not prove hidden chain-of-thought should be exposed to users
SFT vs RL-enhanced training	Main mechanism ablation	Joint RL improves system performance beyond supervised role training	It does not isolate every possible RL algorithm alternative
Update all agents vs top-$k$ variance-based updating	Ablation and efficiency test	Selective updating improves accuracy and reduces update steps	It does not prove top-$k$ is optimal across domains or larger models
Remove efficiency reward	Ablation	Efficiency constraints help balance accuracy, reasoning rounds, and training cost	It does not prove one fixed efficiency formula fits all operations
Remove memory	Ablation	Memory improves convergence and final accuracy in this benchmark	It does not prove memory stores are always safe, clean, or governance-ready
Vary number of sub-agents	Sensitivity test	Fewer agents can be better for individual tasks; more agents help collaborative tasks	It does not justify adding agents indiscriminately

The top-$k$ updating results are especially instructive. Updating all nodes gives 40.0% average accuracy. Updating the top-5 nodes gives 44.0%. Updating only top-1 or top-2 is worse. The lesson is not “five is magic.” The lesson is that bottleneck selection must be wide enough to capture system weaknesses but narrow enough to avoid wasting updates on stable parts of the trajectory.

The sub-agent count ablation adds another useful dose of reality. For individual QA and e-commerce function-call tasks, fewer sub-agents perform better: the master plus two sub-agents reaches 26.0% on QA and 55.0% on e-commerce function calling, compared with 22.0% and 48.0% for the full four-sub-agent setup. For collaboration, however, the full set performs best.

That is a clean operational lesson: agent count is not a vanity metric. More agents increase capability breadth, but also routing complexity and interference. Use more agents when the task genuinely requires cross-role composition. Do not create a “senior strategic reflective verifier agent” because the architecture diagram looked lonely.

What businesses can actually infer

The paper directly shows that a hierarchical multi-agent e-commerce assistant can be benchmarked and jointly trained using memory-aware SFT plus system-level RL. It also shows that HiMA-R1 can outperform several larger single-agent baselines on the authors’ e-commerce and ToolBench evaluations, especially where tool selection and domain-specific function calling are central.

Cognaptus’ business inference is narrower and more useful: vertical agent systems should be evaluated as workflows, not as chat personalities.

For e-commerce, marketplace operations, and API-heavy service desks, this implies a practical development path:

Define the real operational roles: intent recognition, policy retrieval, API selection, calculation, response synthesis, escalation.
Build agent-specific SFT data from clean workflow examples, including tool outputs and memory use.
Add system-level outcome data where only the user query and final answer are supervised.
Train the system to optimise final task success, not just individual agent fluency.
Use reward variance or similar diagnostics to identify which roles cause instability.
Treat memory as a scored operational substrate, not a scrapbook.

The return on investment is not “AI agent magic.” It is more mundane, and therefore more credible: better tool routing, less unnecessary API usage, faster convergence during training, and the possibility of using smaller models where domain structure compensates for general model scale.

This is particularly relevant where the workflow is repetitive but not trivial. Merchant support is a good example. Many requests follow recurring patterns — order labels, store deposits, settlement configuration, product audit status — but they still require correct routing and exact tool use. A model that remembers the right past solution, calls the right API, and stops when enough information has been gathered can be economically better than a larger model that performs all reasoning inside one expensive black box.

Where the result stops

The paper’s own limitations are material. The experiments use small-to-medium open-source models, mainly Qwen2.5 backbones, and the domain is e-commerce. Training uses serious infrastructure: the main text mentions 8 NVIDIA H200 GPUs, while the appendix describes RL deployment with 2 H200 GPUs for accelerated inference and real-time weight updates plus 6 GPUs for joint training. This is not a weekend prompt-engineering recipe.

The benchmark is also specialised. That is a strength for e-commerce realism and a weakness for general claims. The results do not prove that the same method will work equally well in legal operations, medical triage, financial advisory, software engineering, or procurement. Those domains have different reward surfaces, tool risks, latency constraints, and memory-governance problems.

The evaluation metrics also deserve sober interpretation. Function-call correctness based on API names, QA semantic similarity thresholds, and exact-match math are reasonable benchmark choices, but production environments often need stricter checks: argument correctness, permission control, compliance constraints, user identity validation, auditability, and recovery from tool failure. A model that selects the right API name can still pass the wrong parameter. Enterprise systems are annoying like that.

Finally, the comparison against frontier models should not be overgeneralised. GPT-4o remains strongest on the HiMA-Ecom average. HiMA-R1’s advantage is concentrated in domain-specialised coordination and tool-calling efficiency. That is enough to matter. It is not enough to declare the end of large generalist models, a genre of claim that should probably be retired for public health reasons.

The real lesson: train the workflow, not the mascot

HiMA-Ecom and HiMA-R1 are valuable because they move the multi-agent discussion away from theatrical collaboration and toward trainable workflow structure. The system has roles. The roles have memories. The agents are updated jointly from final outcomes. Training focuses on unstable trajectory points instead of treating every component as equally guilty.

That is exactly the kind of idea vertical AI needs. Businesses do not need more agent diagrams with arrows, icons, and heroic job titles. They need systems that know when to retrieve, when to call, when to calculate, when to reuse memory, and when to stop.

The paper does not solve all of multi-agent AI. It does something more useful: it turns a common operational fantasy — many specialised AI workers collaborating smoothly — into a benchmarked training problem. Not glamorous. Much better.

Cognaptus: Automate the Present, Incubate the Future.

Junxing Hu et al., “HiMA-Ecom: Enabling Joint Training of Hierarchical Multi-Agent E-commerce Assistants,” arXiv:2506.19846v2, 1 April 2026, https://arxiv.org/abs/2506.19846. ↩︎

TL;DR for operators#

The expensive problem is not “more agents”; it is coordinating them#

HiMA-Ecom turns e-commerce support into a trainable agent system#

VR-GRPO trains the bottleneck, not the whole committee#

The reward is doing three jobs at once#

Memory is not a notebook; it is part of the policy loop#

The evidence: strong where orchestration matters, weaker where scale still dominates#

The ablations explain the mechanism better than the leaderboard does#

What businesses can actually infer#

Where the result stops#

The real lesson: train the workflow, not the mascot#