Grounded and Confused: Why RAG Systems Still Fail in the Enterprise

If you’ve been following the RAG (retrieval-augmented generation) hype train, you might believe we’ve cracked enterprise search. Salesforce’s new benchmark—HERB (Heterogeneous Enterprise RAG Benchmark)—throws cold water on that optimism. It exposes how even the most powerful agentic RAG systems, armed with top-tier LLMs, crumble when facing the chaotic, multi-format, and noisy reality of business data.

Deep Search ≠ Deep Reasoning

Most current RAG benchmarks focus on shallow linkages—documents tied together via entity overlap or topic clusters. HERB rejects this toy model. It defines Deep Search as not just multi-hop reasoning, but searching across unstructured and structured formats, like Slack threads, meeting transcripts, GitHub PRs, and internal URLs. It’s what real enterprise users do daily, and it’s messy.

In contrast to benchmarks like MultiHop-RAG or HotpotQA, HERB simulates realistic workflows across three software lifecycle stages—planning, development, and support—and injects real-world noise: overlapping teams, renamed products, competitor references, and legacy data. This isn’t synthetic trivia; it’s organizational archaeology.

Benchmarked Failures

HERB includes 815 answerable and 699 unanswerable questions grounded in 39,190 enterprise artifacts. The results are humbling:

System Avg. Score (out of 100)
0-shot GPT-4o 4.55
Best Standard RAG 20.61
Agentic RAG (ReAct + GPT-4o) 32.96

Even ReAct, with tool-augmented multi-hop reasoning, only marginally improves over simpler hybrids. Most errors stem not from poor reasoning—but from failed retrieval. These systems often find partial context, reason over it, and output plausible but wrong answers. Worse: they rarely recognize when no answer exists.

Long-Context LLMs vs Retrieval: A Brutal Tradeoff

To test whether retrieval is the core bottleneck, HERB also runs a “long-context” evaluation: give the LLM the entire product-specific data dump and let it reason. The results are striking:

Model Long-Context Score RAG Score
Gemini-2.5-Flash 76.55 41.86
DeepSeek-R1 58.66 34.81

Retrieval systems choke on the same context that long-context models handle with far more grace. But these models still aren’t perfect. When given oracle evidence (i.e., only the exact documents needed), even the best models (Gemini-2.5) fall short of 90. Why? Because LLMs do not replicate human reasoning chains. They infer, interpolate, and sometimes hallucinate.

The ReAct Agent: Smarter, but Still Superficial

HERB also tracks how the ReAct agent uses tools. Surprisingly, in over 40% of cases, it only uses unstructured search—and rarely performs deep iterative lookups. It often settles for the first plausible hit. Even when structured tools (like employee ID mappers or PR metadata extractors) are available, it misuses or underuses them.

In enterprise settings, this is a fatal flaw. Real users don’t want “likely” answers. They want accountable, traceable ones.

Lessons for Enterprise AI Builders

  1. Retrieval is still the bottleneck. No matter how good your model is, if it can’t find the evidence, it can’t reason correctly.
  2. Shallow evaluations mislead. Benchmarks that skip structured data and realistic distractions inflate model capabilities.
  3. Agent frameworks need procedural depth. Tool-using agents must be trained or incentivized to do real investigations, not surface-level fetch-and-guess routines.
  4. Recognizing ignorance matters. Most models fail to identify unanswerable queries, which is dangerous in high-stakes enterprise tasks.
  5. Long-context is promising—but expensive. Until retrieval improves, giving models more (relevant) context might be more practical than perfect RAG.

What HERB Means for the Future

HERB is not just another benchmark—it’s a call to realism. The next wave of AI systems must not only retrieve better, but also reason more like a meticulous analyst, not a hallucinating oracle.

For vendors building enterprise copilots or RAG assistants, HERB should serve as a warning: if your system performs well on existing benchmarks, it might still be useless in the real world.

At Cognaptus, we believe evaluation should match deployment conditions. HERB brings us closer. The next step? Build RAG agents that not only know how to search—but know when to stop.


Cognaptus: Automate the Present, Incubate the Future