A customer-support bot fails in the most ordinary way.
It retrieves the right policy document. It identifies the right customer case. It even quotes the correct refund condition. Then, somewhere between retrieval and answer synthesis, it forgets that the customer bought the product through a reseller, not directly from the company. The final answer is plausible, polite, and wrong. The system did not lack information. It lacked coordination.
That is the useful starting point for reading HERA, the framework proposed in Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts by Sha Li and Naren Ramakrishnan.1 The paper is not mainly saying, “Use more agents.” That would be the usual agentic-AI reflex: when one LLM is unreliable, add a committee, give everyone a title, and hope the meeting minutes become intelligence. Very Silicon Valley. Very expensive.
The more interesting claim is narrower and stronger: multi-agent RAG systems should remember why previous executions succeeded or failed, then use that memory to reshape both the agent topology and the role-specific prompts. In HERA, experience is not a chat history dump. It is an operational prior.
That distinction matters because many enterprise RAG systems already have the visible parts: retrievers, rerankers, query rewriters, answer generators, validators, maybe a reflective agent with a reassuring name. What they often lack is a way to improve the coordination pattern without retraining the underlying model. HERA attacks that missing layer.
HERA’s main move is to optimize the workflow, not the model weights
The paper frames HERA as a hierarchical framework for multi-agent RAG. The underlying LLMs remain frozen. HERA instead evolves two things around them:
- the orchestration topology: which agents are selected, in what order, with what dependencies, and whether some steps should run in parallel;
- the role-specific prompts: how each agent should behave after the system has diagnosed its recurring failures.
This is an important shift. In a conventional training story, improvement means updating parameters. In a conventional prompt-engineering story, improvement means editing a prompt by intuition, usually after someone sees a bad answer and mutters darkly at the screen. HERA sits between these two. It treats successful and failed trajectories as data, extracts reusable natural-language insights from them, and feeds those insights back into orchestration and prompt design.
The paper calls this a training-free framework. That phrase should not be misunderstood. HERA is not “free” in the sense of zero computation. It still samples candidate topologies, executes agent trajectories, evaluates results, compares successes and failures, updates an experience library, and sometimes replays failed trajectories with prompt variants. The saving is that it avoids gradient-based model updates. The optimization happens through structured experience and prompts, not through changing the LLM weights.
A compact way to see the architecture is this:
| Layer | What HERA changes | What remains fixed | Practical meaning |
|---|---|---|---|
| Orchestrator | Agent selection, execution order, dependencies | Base orchestrator model weights | The system learns better workflow patterns without retraining the model. |
| Experience library | Query-type insights and utility records | Corpus and agent pool, unless separately changed | Past runs become reusable operational knowledge. |
| Execution agents | Role-specific prompt rules and behavioral principles | Agent LLM weights | Agents improve by sharper instructions tied to their actual failure modes. |
| Topology mutation | Replacement or augmentation of failed structures | General optimization loop | Persistent failure can trigger structural redesign, not just another retry. |
This is why a mechanism-first reading is better than a benchmark-first reading. The reported results are impressive, but the real business question is not “How high is the F1 score?” It is: what kind of system learns from its own work without becoming an ungoverned pile of agent chatter?
HERA’s answer is: a system that turns execution traces into structured coordination knowledge.
The orchestrator learns by comparing trajectories, not by admiring its own answer
The first mechanism is structure-level policy optimization. HERA samples a group of candidate agent sequences for a query. Each sequence is executed. The resulting trajectories are ranked first by task performance, such as F1, and then by efficiency, measured by total input and output tokens consumed across the agents.
Then comes the key step: the orchestrator is asked to explain why successful trajectories worked and failed trajectories did not. The paper describes these explanations as group-relative semantic advantages. In plainer language, HERA replaces a numeric gradient with a structured verbal diagnosis.
That sounds softer than reinforcement learning, but for multi-agent RAG it may be more useful. A scalar reward can say that one trajectory was better. It does not naturally say that the evidence selector should have been placed before the answer generator, or that comparison questions benefit from parallel retrieval followed by serial aggregation. HERA’s reflective comparison can encode such operational patterns directly.
The paper is careful about the comparison condition: insight extraction focuses on groups that contain both successful and failed trajectories. That is not a cosmetic detail. If all sampled workflows succeed, the system learns little about what was necessary. If all fail, it learns little about what works. Mixed groups create contrast, and contrast creates diagnosis.
For enterprise teams, this maps to a familiar problem. Many AI workflow logs are forensic, not developmental. They are kept so someone can investigate failure after the fact. HERA suggests a different use: logs should become a source of reusable workflow rules. The system should not only store “what happened.” It should store “what pattern should be preferred next time, for this kind of task.”
The experience library is small institutional memory, not a transcript warehouse
HERA’s experience library stores insights in a Profile–Insight–Utility structure. The profile describes the query characteristics or type. The insight is a natural-language strategy. The utility records how often that insight has helped subsequent orchestration.
This design matters because raw memory is cheap and mostly useless. A transcript warehouse can grow forever while making retrieval slower, noisier, and more embarrassing. HERA tries to keep memory operational by updating it through four actions:
| Operation | What it does | Why it matters |
|---|---|---|
| ADD | Inserts a distinct new insight | Captures a genuinely new coordination lesson. |
| MERGE | Combines similar or complementary insights | Prevents fragmentation of near-duplicate rules. |
| PRUNE | Removes conflicting or low-utility entries | Keeps bad habits from becoming “organizational knowledge.” |
| KEEP | Leaves the library unchanged | Avoids pretending that every run teaches something profound. |
The last point deserves more respect than it usually gets. Systems that “learn from everything” often learn noise with excellent enthusiasm. HERA’s consolidation step is a small but important guardrail: experience must remain generalizable enough to guide future runs.
When a new query arrives, the orchestrator retrieves relevant experience entries by balancing empirical utility and diversity. Utility favors insights that have worked before. Diversity prevents the system from retrieving five versions of the same lesson and calling that wisdom. The result is an experience-driven prior over possible topologies.
This is one of the paper’s more business-relevant ideas. In an enterprise setting, the experience library would not just be a technical cache. It would be a form of process memory: “for contract comparison, retrieve both clauses in parallel, then reconcile obligations”; “for policy exceptions, validate effective dates before answer synthesis”; “for financial figures, hand off arithmetic to a deterministic tool.”
The paper does not test those enterprise cases directly. Its experiments use QA and fact-verification benchmarks over Wikipedia-style corpora. But the architectural lesson is portable: durable AI workflows need curated procedural memory, not only bigger retrieval context.
Role-aware prompt evolution fixes the agent, not just the plan
Better orchestration can still fail if the agents themselves keep making role-specific mistakes. HERA addresses this with Role-aware Prompt Evolution, or RoPE.
RoPE begins with credit assignment. When a trajectory fails, the orchestrator identifies which agent contributed most to the failure. HERA then maintains a buffer of recent failed trajectories for that agent. Prompt variants are generated along behavioral axes such as thoroughness, risk sensitivity, error correction, and heuristic injection. The system re-executes the original whole trajectory with these variants and compares the results.
From that contrastive analysis, HERA extracts two kinds of prompt updates:
| Prompt update type | Time horizon | Example of what it could encode |
|---|---|---|
| Operational rules | Short-term correction | “When comparing two entities, retrieve the target attribute for both before generating the answer.” |
| Behavioral principles | Longer-term strategy | “Prefer explicit decomposition when the query contains hidden dependency between entities.” |
The paper’s own formal expression is simple: the prompt update combines immediate operational corrections with longer-term behavioral principles. Then prompt consolidation integrates selected updates while pruning redundancy and keeping the prompt coherent.
This is a practical improvement over the normal enterprise prompt ritual. The usual process is: user reports bad output; product manager asks for “more caution”; prompt grows by another paragraph; model becomes slightly more verbose and not necessarily more correct. HERA’s RoPE is more disciplined. It asks: which agent failed, under what recurring pattern, and which prompt variant actually improves the full trajectory?
The important word is role-aware. A retriever, evidence selector, context validator, and answer generator should not receive the same generic instruction to “be accurate.” Accuracy is not a job description. The retriever needs search and coverage discipline. The evidence selector needs relevance discrimination. The answer generator needs grounded synthesis. The context validator needs contradiction and sufficiency checks. HERA’s prompt evolution respects those differences.
Topology mutation admits that some failures are structural
The third mechanism is topology mutation. If trajectories consistently fail, for example with F1 equal to zero, HERA explores alternative structures. It may replace a failed agent with another one, or augment the topology with an additional agent. These candidates then re-enter the same gradient-free optimization loop.
This is a small but necessary idea. Some failures cannot be fixed by asking the same workflow to “try harder.” If a pipeline lacks a symbolic calculation step, better wording may not solve arithmetic. If a query requires temporal interval intersection, parallel retrieval plus ordinary answer generation may still produce the union rather than the intersection. The paper’s case study on temporal multi-hop QA shows exactly that: the system retrieves the correct activity periods but returns the union interval, not the intersection. The failure is not retrieval. It is reasoning composition.
That case is valuable because it prevents the article from becoming a victory lap. HERA improves coordination, but coordination is not magic. Some tasks require a different tool, a different reasoning module, or a stricter output contract. Agentic systems fail most dangerously when each component does its local job and the composition is still wrong.
The main evidence says HERA is broad, not unbeatable in every cell
The paper evaluates HERA on six knowledge-intensive benchmarks: HotpotQA, 2WikiMultiHopQA, MusiQue, AmbigQA, Bamboogle, and HoVer. The first four are in-domain multi-hop or ambiguous QA settings, while Bamboogle and HoVer are used for out-of-distribution evaluation. The system uses Wikipedia as the corpus, BGE retrieval, Qwen-3-14B or Llama-3.1-8 as orchestrator backbones, and GPT-4o-mini as the agent model. The underlying backbones remain frozen.
The headline number is a reported 38.69% average improvement over recent baselines. That is the obvious sentence. It is also the sentence most likely to make readers stop thinking. So let us slow it down.
HERA-Qwen is not the top entry in every metric cell. For example, AceSearcher has a higher HotpotQA F1 in the main comparison table, and ExSearch has stronger Bamboogle F1. But HERA is consistently strong across datasets and metrics, particularly on MusiQue, AmbigQA, and HoVer, where multi-step reasoning, ambiguity handling, and fact verification stress the coordination layer.
A concise reading of the main result is therefore: HERA’s advantage is not that it wins every isolated metric; it is that it performs broadly well across heterogeneous multi-hop and ambiguity-heavy tasks while preserving efficiency.
| Evidence block | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark table | Comparison with prior work | HERA is broadly competitive or superior across multiple knowledge-intensive QA and verification datasets. | It does not prove universal superiority across all RAG tasks or enterprise domains. |
| Ablation studies | Component contribution | Experience library and prompt evolution both matter, with prompt evolution often causing larger drops when removed. | It does not isolate every possible interaction with retriever quality or corpus design. |
| Token-efficiency analysis | Efficiency and scaling interpretation | Gains do not appear to come simply from brute-force token expansion. | It does not provide full production cost, latency, or infrastructure benchmarks. |
| Topology evolution metrics | Mechanism analysis | Agent networks move from exploratory, diffuse structures toward compact, high-utility coordination. | It does not prove that the same topology dynamics will emerge in regulated enterprise workflows. |
| Case studies | Error diagnosis and interpretability | Success and failure examples clarify when topology helps and when missing reasoning modules still matter. | Case studies are illustrative, not statistical proof. |
This is the kind of evidence package we should want from an agentic RAG paper. The main table tells us whether the system works. The ablations tell us which parts matter. The token analysis tells us whether the method is merely spending more. The topology metrics tell us whether the internal coordination pattern changes in the intended direction. The case studies tell us how the system fails when it fails.
Not glamorous. Useful.
The ablations show two different failure modes: procedure and grounding
The ablation results are especially important because they prevent a lazy interpretation of HERA as “memory improves RAG.” Yes, memory helps. But the paper’s ablations suggest a more specific split.
Removing the experience library produces consistent but moderate performance declines, typically around 6% to 15% relative accuracy in the Qwen-3-14B setting. Removing prompt evolution causes larger and more variable drops on multi-hop QA, reaching up to about 30% relative on 2WikiQA for both backbones. The authors interpret this as evidence that the two components address different problems: prompt evolution improves procedural correctness, while the experience library improves epistemic reliability.
That distinction is useful for system design.
If an AI workflow often retrieves weak or irrelevant evidence, the problem is not mainly that the answer generator needs a better personality. The experience library and retrieval-aware orchestration matter more. If the system retrieves the right facts but combines them incorrectly, the bottleneck is procedural: decomposition, validation, calculation, or synthesis behavior. RoPE is aimed at that level.
The full HERA framework outperforms both ablated versions by a non-additive margin. In business language: the components reinforce each other. Better agent behavior produces cleaner trajectories; cleaner trajectories produce better stored experience; better stored experience improves future orchestration. This is the compounding loop HERA wants.
The appendix strengthens the same pattern with Llama-3.1-8. There, removing prompt evolution drops F1 from 58.0 to 51.2 on HotpotQA, from 55.5 to 53.3 on 2WikiQA, and from 60.8 to 58.0 on AmbigQA. Removing the experience library also hurts, including F1 drops from 58.0 to 52.5 on HotpotQA and from 55.5 to 46.75 on 2WikiQA. The exact magnitudes vary, but the story is consistent: HERA is not one trick. It is an interaction between memory and role-specific behavioral repair.
Token efficiency is not a side issue; it is the anti-brute-force argument
Agentic RAG papers often hide a simple bargain: better answers, more tokens. Sometimes that bargain is acceptable. Often it is just a very expensive way to discover that meetings get longer when more agents are invited.
HERA’s token analysis is therefore not a minor supporting chart. It addresses a central business concern: can the system improve reasoning without turning every query into a small cloud-compute festival?
The paper reports an exploration-to-exploitation pattern. On HotpotQA, token consumption initially spikes as the orchestrator activates more agents and explores extended reasoning traces. As experience accumulates, token usage declines and plateaus, suggesting that the system prunes low-efficiency agents and shortens reasoning chains. On 2WikiQA, the paper reports no similar initial exploratory peak, which the authors interpret as evidence that accumulated priors transfer into more efficient orchestration.
The performance-versus-token comparison also matters. HERA-Qwen achieves the highest absolute performance, while HERA-Llama is reported to use roughly 4.7k to 6.6k tokens across most datasets while maintaining competitive F1. CORAG, by contrast, is described as consuming up to 20k+ tokens across many datasets without converting that cost into consistently better reasoning performance.
For enterprise adoption, this is where the mechanism becomes economically interesting. The value is not simply “higher accuracy.” It is the possibility of cheaper diagnosis: the system learns which agents are worth invoking for which query type. If that holds in production, orchestration memory can reduce waste in exactly the place agentic systems tend to create it: redundant multi-step reasoning.
That remains an inference, not a direct production benchmark. The paper does not measure enterprise latency, cloud cost, human review cost, or failure-remediation cost. But the token evidence supports the direction: HERA’s gains are not explained by indiscriminate context expansion.
The topology analysis is the paper’s most overlooked business clue
HERA introduces topology metrics to study how agent interaction structures evolve. This is easy to skip because it sounds like a research appendix wearing a graph-theory hat. It should not be skipped.
The paper uses transition entropy to measure uncertainty in agent-to-agent transitions. Entropy initially rises and then stabilizes at an intermediate level. The authors interpret this as structured flexibility: the system consolidates useful paths without collapsing into a rigid pipeline.
It also analyzes graph metrics such as number of agents, node efficiency, self-loops, cycles, and diameter. Across learning phases, trajectories move from narrow and shallow, to broader exploration, then toward compact chains with higher per-agent utility and fewer redundant loops. Some cycles remain, presumably where iterative verification still helps.
This is the paper’s systems lesson. A good multi-agent RAG system should not maximize agent count. It should maximize useful coordination. Sometimes that means adding agents. Sometimes it means removing them. Sometimes it means preserving a loop because verification is useful. Sometimes it means cutting the loop because the system is just pacing around the room.
For business readers, topology metrics are also a governance idea. In production, we should not only monitor final answer quality. We should monitor workflow shape:
| Operational signal | Bad version | Better version |
|---|---|---|
| Agent count | More agents for every query | Query-type-specific agent selection |
| Self-loops | Repeated reflection without new evidence | Bounded validation tied to uncertainty |
| Diameter | Long chains by default | Longer chains only for genuinely multi-hop tasks |
| Cycles | Endless revise-and-critique loops | Targeted verification cycles |
| Node efficiency | Many agents with marginal contribution | Fewer agents with clearer responsibility |
This is where HERA turns from a QA method into a design pattern for AI operations. The system becomes inspectable not only by outputs but by its internal coordination behavior.
The case studies are useful because they are not all flattering
The paper’s case studies include both successes and failures. The successful comparison case shows a sensible pattern: retrieve independent entity facts in parallel, then aggregate serially for comparison. That is exactly the kind of topology one would hope a multi-agent RAG system learns.
The causal case separates retrieval of causal evidence from reasoning over uncertainty. That matters because causal questions often fail when systems retrieve related facts and then overstate certainty. HERA’s serial dependency-aware workflow is meant to keep evidence grounding before causal synthesis.
The failures are more instructive.
In the temporal case, the system correctly retrieves two time intervals: 1914–1938 and 1921–1944. But the answer generator returns 1914–1944, the union, instead of 1921–1938, the intersection. This is not a retrieval failure. It is a set-operation failure. A business equivalent would be retrieving the correct contract start and end dates for two obligations, then calculating the wrong overlap period. The answer can look authoritative precisely because the evidence is correct.
In the intersection case, the query requires identifying a shared property between two people. The intended property is “Soviet,” but the query rewriter shifts toward what they were known for, causing retrieval and evidence selection to emphasize professional roles. The final answer becomes “scientists,” which is plausible but semantically misaligned. This is not random hallucination. It is property-dimension drift.
These failures define the boundary of HERA’s contribution. Better orchestration and prompt evolution reduce many coordination errors. They do not eliminate the need for explicit symbolic tools, schema constraints, typed property extraction, or deterministic verification when the task requires them.
What Cognaptus would infer for enterprise RAG design
The paper directly shows improved benchmark performance, meaningful ablation patterns, token-efficiency evidence, and evolving topology behavior in multi-hop and ambiguous QA settings. From that, Cognaptus would infer five practical design rules for enterprise RAG systems.
First, store comparative execution lessons, not just documents and conversation logs. A useful memory entry should say something like: “For exception-policy questions, validate effective date and jurisdiction before synthesis.” That is operational knowledge. A transcript is just a fossil.
Second, make routing depend on query type. A comparison query, temporal query, ambiguous query, and causal query do not deserve the same agent sequence. Static pipelines are comfortable because they are easy to diagram. They are also how error propagation gets a long-term lease.
Third, evolve prompts by role and evidence. Prompt updates should be tied to failed trajectories and tested variants. “Be more careful” is not an improvement strategy. It is a note someone writes when they do not know which subsystem failed.
Fourth, monitor topology as a production metric. If every query activates every agent, the system is not intelligent; it is bureaucratic. If reflection loops increase without improving outcomes, the system is not thinking harder. It is stalling.
Fifth, route formal reasoning to formal tools. HERA’s temporal failure is a clean warning. When the task requires interval overlap, arithmetic, compliance logic, or structured comparison, the system should use deterministic modules or typed validators. LLM agents can coordinate; they should not be trusted to silently perform every operation inside prose.
Where the paper’s evidence stops
The paper is under review, and its experiments are benchmark experiments, not production deployments. That does not invalidate the results. It just tells us how far to carry them.
The strongest evidence is for multi-hop and ambiguous QA over Wikipedia-style corpora, plus fact verification in HoVer. The system uses curated training samples based on reasoning type and complexity. It uses GPT-4o-mini as agents and Qwen-3-14B or Llama-3.1-8 as orchestrator backbones. Different enterprise corpora, weaker retrievers, domain-specific ontologies, access-control constraints, and noisy internal documents may change the behavior substantially.
The paper also does not settle the operational cost question. Token usage is a useful proxy, but production cost includes latency, parallelism, infrastructure overhead, logging, audit, human escalation, and failure remediation. A system can save tokens and still be awkward to run if orchestration is too complex.
Finally, HERA’s own case studies show that reasoning composition failures can survive successful retrieval. This is not a minor limitation. In legal, finance, procurement, healthcare administration, and compliance workflows, a wrong intersection or property dimension can be more dangerous than a failed retrieval because the answer carries the smell of evidence.
So the correct business reading is not: “HERA solves enterprise RAG.” The correct reading is: “HERA identifies a missing learning layer between static RAG pipelines and expensive model retraining.” That layer is experience-driven orchestration.
The real product idea is an AI workflow that remembers process, not just facts
Most enterprise RAG discussions are still too document-centered. Which vector database? Which chunk size? Which reranker? Which context window? These are real questions, but they mostly concern how the system finds information.
HERA asks a different question: after the system has tried to answer many hard questions, what does it learn about how to organize the work?
That is the more mature direction for agentic AI. The future enterprise assistant should not merely retrieve facts and generate text. It should recognize task structure, choose an appropriate workflow, assign roles, validate intermediate outputs, learn from failures, compress useful experience, and avoid repeating expensive mistakes.
Not because agents are magical. Because process memory is valuable.
HERA’s contribution is to show one plausible way to build that memory without retraining the model: compare trajectories, extract semantic advantages, consolidate experience, evolve prompts, mutate topology, and let the workflow become more selective over time.
The amusing part is that this makes multi-agent RAG look less like artificial intelligence and more like a competent operations team: remember what worked, stop inviting unnecessary people, update role instructions after mistakes, and call in a calculator when the job is arithmetic.
Apparently, the future of AI may involve learning basic management. Progress comes in strange forms.
Cognaptus: Automate the Present, Incubate the Future.
-
Sha Li and Naren Ramakrishnan, “Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts,” arXiv:2604.00901v2, 2026, https://arxiv.org/abs/2604.00901. ↩︎