TL;DR for operators
Enterprise RAG does not fail because the chatbot forgot to sound confident. It fails because the answer is often scattered across the least glamorous parts of the company: Slack threads, meeting transcripts, pull requests, document revisions, customer reports, employee metadata, and URLs somebody pasted into a chat six weeks ago.
The HERB benchmark from Salesforce AI Research makes that problem explicit.1 It creates a synthetic but workflow-grounded enterprise environment with 39,190 artifacts and 1,514 queries, including 815 answerable and 699 unanswerable questions. The point is not to ask whether a model can summarize a clean document. That is the demo version of enterprise AI. HERB asks whether a system can find the right evidence across messy, connected, source-diverse work.
The results are uncomfortable. Standard RAG methods perform poorly. Graph-based methods do not magically fix the problem. The best standard baseline, a hybrid dense-plus-BM25 retriever, reaches an average score of only 20.61. A GPT-4o-based ReAct agent improves the result to 32.96, but that is still far from operational reliability. Long-context models do better when the search space is narrowed to product-specific data, with Gemini-2.5-Flash reaching 76.55, but even then the model is not consistently solving the task. In the oracle setting, where models receive only the linked ground-truth evidence, the best model reaches 85.76, which is better but still not clean.
For operators, the lesson is simple and annoyingly expensive: buying “RAG” is not the same as buying institutional memory. The hard part is not just embedding files. It is source selection, multi-hop retrieval, metadata resolution, temporal reasoning, unanswerability detection, and auditable evidence assembly. In plain terms: the system must know where to look, what to ignore, and when to stop pretending the answer exists.
The business implication is that enterprise AI should be evaluated like an evidence operation. Before deploying a RAG assistant into product, support, legal, finance, or engineering workflows, test whether it can retrieve all necessary evidence, use structured tools correctly, identify missing evidence, and explain what sources support the answer. Otherwise, the system is not grounded. It is merely eloquent near some documents.
The awkward result: better RAG still retrieves the wrong enterprise
The strongest part of HERB is not the dataset description. It is the sequence of failures.
In the full-RAG setting, systems must answer questions by searching across the full enterprise artifact pool. This is the setting closest to the executive fantasy: “Connect it to our company data and let people ask questions.” The paper evaluates zero-shot GPT-4o without retrieval, vector retrieval, hybrid vector-plus-BM25 retrieval, RAPTOR, GraphRAG, HippoRAG 2, Proposition-Graph RAG, and agentic ReAct systems.
The obvious result happens first: zero-shot GPT-4o cannot answer enterprise-specific questions without enterprise-specific evidence. It scores 0 on people, customer, and artifact search categories. This is not embarrassing. It is physics. The model cannot know which fictional employee reviewed a fictional PRD in a synthetic company unless the evidence is in context.
The more interesting result is that adding retrieval helps, but not enough. The simple hybrid retriever is the best standard RAG baseline, with an average score of 20.61. Vector retrieval reaches 16.77. RAPTOR reaches 14.77. GraphRAG reaches 10.31. HippoRAG 2 and Proposition-Graph RAG land at 17.21 and 17.25 respectively. These numbers matter because the benchmark is not asking models to perform mystical cognition. It is asking them to answer operational questions whose evidence exists in the corpus.
The ReAct agent performs best among full-RAG systems. With GPT-4o underneath, it reaches an average score of 32.96. This is a meaningful improvement over the hybrid baseline, especially because the agent has access to structured tools for employee, customer, pull request, and URL lookups. But a score in the low thirties is not the triumphant arrival of enterprise intelligence. It is a reminder that giving a model tools is different from making it competent at evidence work.
The unanswerable-query results make the problem sharper. HERB includes 699 questions that look realistic but lack sufficient supporting evidence. That is not a side feature. In an enterprise, “we do not have enough evidence to answer that” is often the correct answer. It is also the answer that many AI systems are culturally allergic to producing.
The paper reports wide variation in unanswerability detection. GraphRAG correctly identifies 63.41% of unanswerable queries, while the hybrid method reaches only 25.32%. ReAct’s behaviour depends heavily on the underlying model: Gemini-2.5-Flash reaches 63.66%, while o4-mini reaches only 6.01%. That range is not a small implementation wrinkle. It says that more capable-seeming systems may become worse at refusing unsupported answers, depending on how they search and reason.
This is the first business punchline: retrieval accuracy and refusal discipline must be evaluated together. A system that answers more questions by hallucinating across partial evidence is not more useful. It is just more confident while being wrong, which, admittedly, is a venerable management style.
HERB makes the benchmark look more like work, not trivia with an org chart
HERB is designed around a problem that most RAG benchmarks soften: enterprise facts are not usually stored as tidy paragraphs.
The benchmark simulates a simplified software company with 530 employees, 30 products, 120 customer profiles, 302 Slack channels, 33,632 Slack messages, 400 documents, 3,562 pull requests, 575 shared URLs, 321 meeting transcripts, and 50 meeting chats. The data spans planning, development, and support workflows. Queries cover four search intents: content, people, artifacts, and customers.
That structure matters because the answer to a realistic enterprise question may not live in a single “document.” It may require finding a PRD, locating who reviewed it, checking Slack feedback, reading a meeting transcript, resolving names to employee IDs, and distinguishing current contributors from people merely discussing an older release. In a real company, this is not an exotic edge case. It is Tuesday.
The paper’s data generation process is query-first. Instead of taking random documents and asking an LLM to invent questions after the fact, the authors manually define realistic enterprise query templates and reasoning scenarios, then synthesize workflows that generate the evidence needed to answer them. This design choice is important. Post-hoc synthetic RAG benchmarks can produce questions that are technically multi-hop but operationally artificial. HERB tries to make the question resemble something a person inside a company might actually ask.
The benchmark also deliberately adds enterprise noise. Products can be renamed. Teams can discuss similar features from competitors or open-source projects. Multiple planning teams can appear relevant, although only one continues into development. Public external links can look topically useful while being irrelevant to the current product. Legacy feedback can refer to previous releases created by different teams. These are not decorative annoyances. They are the mechanism by which enterprise retrieval becomes hard.
A conventional retriever sees semantic similarity. An enterprise search system must also understand source role, workflow stage, identity, chronology, and operational relevance. “This Slack message mentions the product” is not enough. The real question is whether that Slack message is part of the evidence chain for the specific query.
That is where the common misconception collapses. The enterprise RAG problem is not simply that the vector database needs better embeddings. Better embeddings help, but they do not automatically know whether a meeting mention was feedback, assignment, historical discussion, customer escalation, or irrelevant chatter. Nor does a longer context window automatically repair poor evidence selection. A longer haystack is still a haystack. It simply has better lighting.
The experiments separate retrieval failure from reasoning failure
The paper’s experimental design is useful because it progressively relaxes the problem. This lets us identify what kind of failure is happening rather than just saying “the model was bad,” which is the analytical equivalent of shrugging in a blazer.
| Test setting | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Full-RAG evaluation over the whole HERB corpus | Main evidence | Current RAG and agentic RAG systems struggle when evidence must be found across heterogeneous enterprise data | That all enterprise RAG deployments will fail at the same rate |
| Product-specific long-context evaluation | Diagnostic comparison | Removing most irrelevant cross-product data improves performance, showing that search-space narrowing helps | That long context alone solves enterprise search |
| Product-specific ReAct RAG | Retrieval comparison under a smaller pool | Even when scoped to product-specific data, retrieval-based agents can trail long-context models | That retrieval should be abandoned |
| Oracle evidence evaluation | Controlled reasoning test | Even with ground-truth evidence, models still make reasoning errors | That retrieval is not the main bottleneck |
| ReAct attempt-budget test | Sensitivity test | More attempts improve performance modestly, but at cost and with refusal trade-offs | That simply retrying solves deep search |
| Manual error analysis | Diagnostic explanation | Failures involve unanswered outputs, incorrect reasoning, incomplete use of context, and shallow tool use | That these are the only possible failure modes |
This progression is the article’s spine. Full-RAG results show the real-world challenge. Long-context results show what happens when irrelevant enterprise sprawl is reduced. Oracle results show whether models can reason when retrieval is removed as a bottleneck. Human analysis then explains the flavour of failure.
In business terms, the benchmark is less a leaderboard than a diagnostic machine. It asks: is the system failing because it cannot find evidence, because it cannot reason over evidence, or because it cannot tell when evidence is missing?
The answer is: yes, with retrieval wearing the largest hat.
Long context helps, but only after someone already narrowed the office
The long-context experiment is one of the most operationally relevant parts of the paper.
Here, the model does not search the full HERB corpus. Instead, it receives all product-specific content concatenated into context. That reduces the input to 3.33% of the original HERB setting. The model still has to find and reason over the relevant information, but the irrelevant cross-product universe has largely been removed.
The results improve sharply for the strongest models. Gemini-2.5-Flash reaches an average score of 76.55. DeepSeek-R1 reaches 58.66. GPT-4o reaches 38.84. Llama-3.1-405B reaches only 18.20. So yes, long context can help. More precisely, long context helps when someone has already performed a useful narrowing operation.
That distinction matters. “Just put everything in the context window” is not an enterprise strategy. It is an architecture diagram drawn before the invoice arrives. The paper’s long-context setting is already kinder than full enterprise search because it uses product-specific filtering. The model is no longer searching across all products, all workflows, all employees, all customers, and all distractors. It is searching inside a smaller office with fewer filing cabinets.
The comparison with product-specific ReAct RAG is revealing. For Gemini-2.5-Flash, long context scores 76.55, while product-specific ReAct RAG scores 41.86. For DeepSeek-R1, long context scores 58.66, while product-specific ReAct RAG scores 34.81. This gap supports the paper’s main diagnosis: retrieval is limiting performance even for strong reasoning models.
But the long-context result should not be overread. Gemini’s 76.55 is strong relative to the RAG baselines, not equivalent to operational certainty. A roughly three-quarters average score in a benchmark is not what a compliance officer, engineering lead, or customer-support director means by “trustworthy.” It is better, not solved.
The practical lesson is that context engineering and retrieval engineering are not alternatives. They are layers. Product scoping, access control, metadata filtering, temporal constraints, and source selection should narrow the search space before the model tries to reason. Long context then becomes useful because it contains the right mess, not the entire company’s mess.
Oracle evidence exposes the second failure: reasoning over the right files
The oracle experiment removes retrieval and noise. The model receives only the ground-truth evidence linked to each question, reducing the input to 0.213% of the original HERB data. This is the “fine, here are the relevant files” setting.
Performance improves. Gemini-2.5-Flash reaches 85.76. Llama-4-Maverick reaches 80.25. o4-mini reaches 80.00. DeepSeek-R1 reaches 76.92. GPT-4o reaches 61.73. These results show that retrieval is a major bottleneck, because removing it produces a large jump.
But no model reaches perfect performance. That is the second uncomfortable finding.
If the right evidence is already in front of the model, remaining errors are not search errors. They are reasoning errors, context-use errors, answer-format errors, or failures to execute the human-designed inference path. The paper’s human analysis makes this concrete.
Among 815 answerable questions in the oracle setting, Gemini-2.5-Flash scores zero on 71, and DeepSeek-R1 scores zero on 108. The authors examine 50 failed questions per model. DeepSeek-R1 fails to provide a final answer in 15 cases, gives incorrect reasoning in 24, and incompletely uses context in 11. Gemini fails to answer in 2 cases, reasons incorrectly in 32, and incompletely uses context in 16.
The recurring error pattern is not “the model did not see the keyword.” It is subtler. For questions about a previous product release, models often infer that current team members who discuss old artifacts were contributors to the earlier work. Correct reasoning requires identifying the people who actually created or reviewed those artifacts. For active bug questions, models may focus on reopened bugs while ignoring unresolved ones, even when both matter.
This is where enterprise RAG becomes less like search and more like investigation. Evidence does not interpret itself. A Slack thread can mention a bug, a meeting can discuss a reverted PR, a customer record can identify the account, and a GitHub link can show whether a fix exists. The model must combine these into an answer while respecting chronology and roles.
So the oracle setting gives us the second business lesson: even perfect retrieval would not eliminate the need for reasoning evaluation. If a vendor shows only retrieval recall, they have shown the system can bring documents to the table. They have not shown it can read the room.
Agents use tools, but mostly like interns with a search box
Agentic RAG is supposed to help because agents can plan, search iteratively, and call structured tools. HERB does show that ReAct improves over standard RAG. But the manual trajectory analysis explains why the improvement is limited.
The authors inspect 50 GPT-4o ReAct trajectories. In 21 cases, the agent uses only unstructured search. In 24 cases, it uses two tools. Only four cases involve three tools, and just one case involves four. The agent always invokes unstructured search, often relying on the first retrieved result instead of performing deeper iterative search.
This is a painfully recognizable pattern. The system has tools, but behaves as though the search box is the real product and everything else is optional.
For artifact-related questions, the agent often fails to use structured PR or URL search tools, even when those tools would be relevant. The paper also reports unnecessary tool use, such as resolving an employee name when that output is not needed, and incorrect tool use, such as querying with an employee ID when the tool expects a name.
This matters because many enterprise AI roadmaps now treat “agentic” as a magic adjective. Add tools, add a planner, sprinkle with ReAct, and suddenly the assistant has “workflow intelligence.” HERB suggests a less theatrical view. Tool availability is not tool competence. A system must know when a structured tool is required, what parameter shape it expects, when the returned result is insufficient, and when to continue searching rather than settle for the first plausible mention.
The attempt-budget test adds another useful detail. Increasing GPT-4o ReAct attempts from 1 to 20 raises average performance from 32.96 to 37.29. That is an improvement, but not a transformation. It also reduces the unanswered rate from 23.03% to 19.17%, which can be good or bad depending on whether the extra answers are supported. More attempts buy some accuracy, but they do not convert shallow search into deep enterprise evidence work.
From an operator’s perspective, this is a cost-quality trade-off. Iteration consumes tokens, time, and tool calls. If the agent’s search policy is weak, more attempts are just more opportunities to be confidently adjacent to the answer.
The business lesson: evaluate evidence operations, not chatbot fluency
The paper directly shows that current RAG and agentic RAG systems struggle on HERB, that long-context models improve when the data is scoped, and that oracle evidence still leaves reasoning errors. Cognaptus’ business inference is that enterprise AI evaluation must move from “answer quality on demos” to “evidence process under realistic workflow conditions.”
That means testing systems on the work they are actually expected to perform.
| Enterprise requirement | What HERB shows | Business meaning |
|---|---|---|
| Find evidence across heterogeneous sources | Full-RAG systems perform poorly across Slack, meetings, documents, PRs, URLs, employee data, and customer metadata | A single vector index is not enough for operational memory |
| Use structured and unstructured data together | ReAct improves performance but uses tools shallowly | Tool access must be evaluated as behaviour, not listed as a feature |
| Detect missing evidence | Unanswerability results vary widely across methods and models | Refusal quality is part of reliability, not a legal footnote |
| Narrow context intelligently | Product-specific long context improves performance for strong models | Metadata, scoping, and source routing can matter as much as the model |
| Reason over evidence | Oracle performance remains imperfect | Retrieval recall is necessary but insufficient |
| Handle enterprise noise | Distractors include product renames, legacy feedback, competitor documents, and cross-product references | Benchmarks must include confusion that resembles work, not only clean QA |
The immediate procurement implication is blunt: do not ask a vendor whether they “support RAG.” Ask what kinds of evidence chains their system can recover and how they evaluate failure.
A useful enterprise RAG evaluation should include at least five tests.
First, multi-source retrieval. Can the system answer a question that requires Slack plus meeting transcript plus PR metadata plus employee ID resolution? If not, it is a document search assistant with nicer manners.
Second, unanswerability. Can it say “not enough evidence” when the required artifact does not exist? A system that never refuses is not helpful. It is a liability wearing a product tour.
Third, temporal reasoning. Can it distinguish a previous release from the current release, a reopened bug from a resolved bug, and a renamed product from a different product? Enterprise facts decay and mutate. The system must understand that “same name” and “same thing” are not synonyms.
Fourth, tool-use discipline. Does the agent select structured tools when needed, pass correct parameters, and continue searching when the first result is incomplete? Tool traces should be inspectable. If the agent cannot show how it found the answer, auditability is theatrical.
Fifth, evidence completeness. Does the answer cite all necessary supporting sources, or only the first plausible snippet? Many wrong enterprise answers are not fabricated from nothing. They are fabricated from one true fragment inflated into a complete story. The little fragment did nothing wrong. The model got ambitious.
The ROI is cheaper diagnosis, not instant automation
HERB should also change how businesses think about return on investment.
The naive ROI story says RAG reduces search time because employees can ask questions in natural language. That may be true for simple retrieval tasks. But the deeper ROI is not just faster answers. It is cheaper diagnosis of where knowledge work breaks.
If a RAG system fails, HERB-style evaluation helps identify whether the failure came from missing connectors, weak indexing, poor metadata, shallow search planning, incorrect tool use, insufficient context, or reasoning failure. Those distinctions matter for investment decisions.
If retrieval is the bottleneck, buying a larger model may be wasteful. The fix may be better source routing, metadata design, access-control-aware indexing, or workflow-specific retrievers. If reasoning is the bottleneck, better retrieval will not fully solve the problem. The system may need constrained reasoning steps, typed intermediate outputs, validation checks, or task-specific answer schemas. If unanswerability is the bottleneck, the system needs refusal calibration and evidence sufficiency checks, not another prompt that says “be accurate” with the optimism of a corporate values poster.
This is where the benchmark becomes practically useful. It does not merely rank models. It gives operators a vocabulary for system failure.
That vocabulary is valuable because enterprise AI failures are often misdiagnosed. A bad answer gets blamed on “the model,” when the actual cause may be missing customer metadata. A hallucination gets blamed on “retrieval,” when the actual issue is that the system found one relevant source and ignored three contradictory ones. A refusal gets blamed on “safety,” when the real problem is that the evidence chain was broken.
The business value of HERB-style testing is therefore not that every company should copy the benchmark exactly. It is that companies should build internal evaluations that resemble their own evidence paths. For a software company, that may mean Slack, GitHub, docs, and support tickets. For a bank, it may mean policies, transaction notes, CRM records, risk memos, and audit logs. For a hospital, it may mean clinical notes, lab results, discharge summaries, and care-team messages—with far stricter governance, obviously, because “the model seemed confident” is not a clinical protocol.
Where HERB should not be overread
HERB is valuable, but it is not a universal measurement of enterprise intelligence.
First, the dataset is synthetic. The authors put substantial human effort into designing workflows, queries, and reasoning paths, but the resulting environment is still simulated. That is a strength for controlled evaluation and ground-truth traceability. It is also a boundary. Real organizations contain uglier permissions, incomplete records, contradictory documents, private side channels, stale dashboards, and people naming files as though search engines personally offended them.
Second, HERB focuses on software and product-development workflows. The sources—Slack, meeting transcripts, documents, GitHub PRs, URLs, employee metadata, and customer profiles—map well to technical organizations. They do not automatically map to finance, healthcare, law, manufacturing, or government. Those domains need their own workflow modelling and expert-designed query sets.
Third, HERB’s workflow specifications are not fully public, partly to discourage overfitting. That is sensible for benchmark integrity, but it also means companies should avoid treating HERB as a turnkey deployment checklist. The benchmark is a diagnostic reference, not a substitute for internal evaluation.
Fourth, the metrics combine different evaluation styles: Likert-style scoring for content queries and extraction-based F1 for people, customer, and artifact queries. This is reasonable for the benchmark’s mixed task types, but average scores should be interpreted as comparative signals rather than a single universal accuracy number.
Finally, the paper does not prove that enterprise RAG is doomed. It proves something more useful and less dramatic: current approaches are brittle when the task requires deep, source-aware search across heterogeneous workflow data. That is not a reason to abandon RAG. It is a reason to stop pretending a vector database plus a cheerful answer style is an enterprise knowledge system.
The next enterprise assistant must know when the file is missing
The most important word in HERB may be “unanswerable.”
Enterprise AI systems are often designed around answer production. The user asks. The assistant answers. The demo proceeds. Everyone nods. Somewhere in the background, governance quietly starts drinking.
But real enterprise knowledge work includes missing evidence, partial evidence, stale evidence, contradictory evidence, and evidence that exists but should not be accessible to the current user. A reliable system must therefore treat “I cannot support that answer from available evidence” as a first-class outcome.
HERB’s unanswerable queries force that behaviour into the evaluation. This is one of the benchmark’s most business-relevant design choices. In a company, unsupported answers are not merely wrong. They can trigger bad customer commitments, incorrect technical decisions, compliance exposure, and internal trust collapse.
A mature enterprise RAG system should therefore produce three things, not one.
It should produce an answer when evidence is sufficient. It should produce a refusal when evidence is insufficient. And it should produce an evidence trail that lets a human understand which sources were used and which sources were missing. Anything less is not grounded AI. It is autocomplete with enterprise access.
Conclusion: the retrieval problem is really a management problem
HERB’s results are easy to summarize and harder to absorb. Standard RAG performs poorly. Agentic RAG helps but remains shallow. Long context helps when the search space is already narrowed. Oracle evidence helps more, but reasoning errors remain. Unanswerability remains uneven and dangerous.
The technical conclusion is that enterprise RAG needs better deep search: source-aware retrieval, structured-tool competence, iterative evidence gathering, temporal reasoning, entity resolution, and refusal calibration.
The business conclusion is sharper. Most organizations do not have a “chatbot problem.” They have an evidence operations problem. Their knowledge is fragmented across tools, workflows, permissions, and time. RAG exposes that fragmentation because it tries to answer questions across it. When the system fails, it is not always because the model is stupid. Sometimes it is because the company’s memory is stored like an archaeological site with Slack notifications.
HERB is useful because it makes that failure measurable. It tells operators to stop evaluating enterprise AI by how fluent the final paragraph sounds and start evaluating how the system found, filtered, combined, and justified evidence.
That is a less glamorous benchmark. It is also much closer to work.
Cognaptus: Automate the Present, Incubate the Future.
-
Prafulla Kumar Choubey, Xiangyu Peng, Shilpa Bhagavath, Kung-Hsiang Huang, Caiming Xiong, and Chien-Sheng Wu, “Benchmarking Deep Search over Heterogeneous Enterprise Data,” arXiv:2506.23139, 2025. https://arxiv.org/html/2506.23139 ↩︎