TL;DR
Deep Research agents are good at planning over messy data. They are less good at finishing the plan without taking convenient shortcuts, which is awkward if the job involves recall, auditability, or a CFO who dislikes “probably”. Semantic-operator systems have the opposite problem: they can process unstructured records methodically, but their iterator-style execution can be expensive, slow, and clumsy when the answer requires reasoning across files.
The paper behind this article argues that these two worlds are converging into a new analytics runtime for unstructured data: part agent, part query optimizer, part materialized-view system.1 Its prototype extends Palimpzest with a richer Context abstraction, agentic compute and search operators, and early support for reusing materialized contexts across queries. The experiments are preliminary—two queries, three trials each—but the architectural signal is useful: business value will come less from “Deep Research” as a product label, and more from disciplined execution beneath the chat box.
Deep Research Is Not an Analyst; It Is a Query Planner With Impulse-Control Issues
A familiar enterprise question looks deceptively simple:
“Find the emails that contain first-hand discussion of these transactions.”
Or:
“Compute the ratio of identity-theft reports in 2024 versus 2001 from this folder of public statistics.”
A human analyst knows that these are not just search tasks. The first question needs high recall across a corpus. The second needs cross-file disambiguation: which file is authoritative, which year is being counted, and which superficially relevant statistics are actually the wrong ones. A conventional BI stack wants structured tables. The data lake gives it CSVs, HTML files, PDFs, emails, and the occasional naming convention held together by vibes and legacy procurement.
This is where Deep Research agents seem attractive. Give the agent a question, let it inspect files, write Python, use tools, revise its plan, and produce an answer. Lovely. Until the agent decides that reading four or five files is spiritually equivalent to reading 132 files. The paper calls out this failure mode directly: open Deep Research agents can form a reasonable high-level plan, then execute it poorly by relying on simplistic keyword filters, premature stopping, or manual validation of only a small subset of records.
Semantic-operator systems fail differently. They bring database discipline to unstructured data: natural-language filters, maps, joins, aggregations, and extraction operators over documents. Systems such as Palimpzest, LOTUS, DocETL, and related work treat LLM calls as operators in a query plan rather than as improvisational acts of genius. That is progress. But semantic operators often inherit iterator semantics: process record after record, even when the right answer requires finding the right file first, reasoning across several files, or making an interactive decision about where to look next.
So the paper’s central mechanism is not “agents beat databases” or “databases beat agents”. It is more interesting, and less bumper-sticker friendly:
Deep Research needs to become an analytics runtime, and analytics runtimes need agentic execution semantics.
That is the useful thought. Everything else is implementation detail, albeit important implementation detail.
The Runtime Needs Three Things: Access, Execution, and Reuse
The prototype extends Palimpzest in three main directions. They map neatly to three operational questions every enterprise AI analytics system eventually faces.
| Runtime question | Paper mechanism | Operational meaning |
|---|---|---|
| How does the agent know what data access methods exist? | Context abstraction |
Wrap a dataset with a description, iterator access, indexes, top-k retrieval, and custom tools. |
| How does the system combine agentic planning with optimized execution? | compute and search operators |
Let agents explore and write code, but give them a tool for generating optimized semantic-operator programs when heavy processing is needed. |
| How does the system avoid paying again for work it already did? | Materialized Context management |
Cache and retrieve execution-derived contexts, similar in spirit to materialized views in OLAP systems. |
The Context abstraction is the foundation. In ordinary Palimpzest, a dataset can be iterated over as a collection of records. The paper generalizes this into something more agent-friendly: a Context has a natural-language description, can expose index-based access methods, and can include dataset-specific tools. An agent no longer has to choose between “read everything” and “guess with filenames”. It can search, inspect, iterate, or call a custom tool depending on the question.
The search and compute operators then make this usable inside a query plan. A search operator enriches a context by finding relevant information and updating the context description with a summary of what was found. A compute operator tries to produce a specific output. Both are physically implemented with CodeAgents, but crucially those agents are not left alone with a Python REPL and a dream. They are given a tool that can execute a natural-language instruction using an optimized semantic-operator program.
That distinction matters. An agent that manually chains semantic filters can still waste money. It may apply filters redundantly, process records that were already filtered out, or choose expensive models where cheaper ones would do. By routing heavy document processing through Palimpzest’s optimizer, the prototype lets the agent plan dynamically while the execution engine handles cost-aware processing.
Finally, the materialized Context idea borrows from a very old database trick: do not recompute what can be reused. When a compute or search operator runs, it creates a new context. The prototype embeds and indexes the descriptions of these materialized contexts, so future queries can retrieve similar prior work. If one query already located the right files and produced a useful execution trace for identity-theft statistics, a later query about another year should not start from zero. Revolutionary? No. That is precisely the point. The most useful AI infrastructure often looks suspiciously like database engineering rediscovering its own furniture.
Why Plain Agents Shortcut and Semantic Operators Overprocess
The paper’s mechanism-first argument is best understood through the two opposite failure modes.
The plain Deep Research agent is flexible. It can inspect file names, write code, make a plan, and adapt. On the Kramabench identity-theft task, this flexibility is valuable because the correct answer is in a single CSV file among 132 CSV and HTML files. The system must identify the right source, extract values for 2001 and 2024, and compute the ratio. A record-by-record semantic pipeline is not naturally elegant here because the task involves global file selection and cross-file reasoning.
But flexibility without execution discipline creates recall risk. On the Enron email task, a plain CodeAgent tended to use simple keyword or regex-like filtering and manually read a few candidate emails. That can yield high precision because the few returned emails may genuinely match. It also misses relevant emails. In business terms, this is the classic “beautiful answer to a smaller question” problem. Legal, compliance, and diligence teams generally do not enjoy discovering later that the system was very precise about the documents it happened to inspect.
Semantic operators, meanwhile, bring discipline by applying LLM-powered filters and maps systematically across the dataset. For the Enron-style filtering task, that is exactly what you want: every email can be evaluated against the natural-language predicate. The problem is cost and latency. If the system applies multiple semantic filters sequentially without checking intermediate outputs or optimizing model choices, it may pay for unnecessary LLM calls across records that should already have been eliminated.
The prototype’s claim is that the runtime should choose between these behaviours rather than being trapped inside one of them. Agents should plan and adapt. Semantic operators should execute the expensive parts. The optimizer should decide how much model quality is needed and avoid redundant computation. Context should preserve what the system learned.
That is the architecture. The benchmark results are evidence for it, not the entire thesis.
The Kramabench Result Tests Cross-File Reasoning, Not Enterprise Omniscience
The first evaluation uses a query from Kramabench’s legal workload. The dataset contains 132 CSV and HTML files with fraud, identity-theft, and consumer-report statistics. The query asks for the ratio between the number of identity-theft reports in 2024 and 2001. The ground truth is in a single CSV file containing annual statistics from 2001 through 2024.
The authors compare three systems across three trials: a handcrafted semantic-operator program, a SmolAgents CodeAgent with file-listing and file-reading tools, and the prototype’s Palimpzest compute operator.
| System | Percent error | Cost | Runtime |
|---|---|---|---|
| Handcrafted semantic operators | 17.00% | $1.66 | 215.2s |
| CodeAgent | 27.56% | $0.03 | 77.0s |
Palimpzest compute |
0.02% | $1.17 | 583.0s |
The important interpretation is not “the prototype is fastest”. It is not. On this task, the prototype is slower than both baselines. The win is accuracy under a query that strains ordinary iterator semantics and casual agent execution.
The handcrafted semantic-operator program computed the correct ratio in all three trials, but in two trials it also produced a second ratio because an errant file slipped through a semantic filter. That matters because production analytics systems rarely get credit for returning the right answer plus a plausible wrong answer in the same envelope. The CodeAgent was cheap and fast, but often found the wrong files and computed spurious ratios from non-ground-truth sources.
The compute operator worked by writing optimized Palimpzest programs to search for the 2024 and 2001 identity-theft information, then using Python to compute the final ratio. That is the mechanism in miniature: agentic decomposition, optimized semantic execution, ordinary computation at the end.
For business readers, the lesson is not that this prototype is already an interactive analytics dream. A 583-second runtime is not exactly “instant dashboard refresh”. The lesson is that accuracy on messy analytical questions may require a runtime that can move between search, semantic processing, and deterministic computation without pretending that one interface style solves all problems.
The Enron Result Tests Recall and Execution Efficiency
The second evaluation uses a document-processing task over a 250-email subset of the Enron dataset. The system must filter for emails containing firsthand discussion of specified business transactions. The original task also asks for sender, subject, and summary extraction, but the paper simplifies evaluation to precision, recall, and F1-score for the returned emails.
The comparison is again across three trials. This time, the baselines are a plain CodeAgent and a stronger CodeAgent+ that has tools for unoptimized semantic filters and maps. The prototype uses the compute operator.
| System | F1 | Recall | Precision | Cost | Runtime |
|---|---|---|---|---|---|
| CodeAgent | 50.53% | 46.15% | 88.89% | $0.08 | 37.0s |
| CodeAgent+ | 98.67% | 97.44% | 100% | $3.76 | 1,999.9s |
Palimpzest compute |
98.67% | 97.44% | 100% | $0.87 | 546.2s |
Here the contrast is cleaner. The plain agent is cheap and fast, but recall is poor. It finds some relevant emails and avoids many false positives, but it misses too much. CodeAgent+ fixes quality by using semantic operators, reaching the same F1, recall, and precision as the prototype. But it does so expensively and slowly. The paper reports that Palimpzest compute preserves the same quality while reducing cost by 76.8% and runtime by 72.7% relative to CodeAgent+.
This is the result that should interest enterprise teams. Not because the dataset is large—it is only 250 emails, deliberately limited to keep costs reasonable—but because the mechanism matches a common production problem. Giving an agent more tools does not automatically make it efficient. An agent can call the right kind of tool in the wrong execution pattern. It may stack filters naively, fail to inspect intermediate outputs, or process records redundantly. Tool access is not optimization. It is merely permission.
Palimpzest’s advantage comes from making the agent write a program for an optimizer-backed semantic execution engine. The engine can avoid redundant computation on already-filtered emails and choose cheaper models for some operators. That is a real systems distinction. In practical terms, the paper is less an argument for “agentic analytics” than for optimizer-mediated agentic analytics. Slightly less catchy. Much more useful.
The Evaluation Is Small, but Its Purpose Is Clear
The paper’s experiments should be read as a prototype demonstration, not a broad benchmark victory lap. That does not weaken the architectural argument; it just bounds it.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Kramabench identity-theft ratio | Main evidence for dynamic execution over cross-file analytics | compute can outperform a handcrafted semantic-operator program and a plain CodeAgent on a query requiring file selection and computation. |
It does not show low latency, broad benchmark generality, or production readiness. |
| Enron firsthand-transaction emails | Main evidence for recall plus optimized semantic execution | compute can match high-quality semantic processing while reducing cost and runtime relative to an agent using unoptimized semantic tools. |
It does not show scaling behaviour across millions of emails or many query types. |
| Context abstraction | Implementation mechanism | Agents need descriptions, indexes, and tools to choose better access patterns. | It does not solve data freshness, permissions, or governance by itself. |
| Materialized Context reuse | Exploratory physical optimization | Prior execution traces may help future queries, similar to materialized views. | The paper does not provide a full evaluation of reuse policies, invalidation, or long-term cache quality. |
| Logical optimizations such as splitting, merging, and dynamic insertion of search | Future work | The authors identify plausible optimizer extensions. | These are not implemented results in the prototype. |
This matters because the tempting misconception is obvious: “Deep Research is now the new enterprise analytics system.” Not quite. The paper’s title is intentionally provocative, but the evidence supports a narrower and more credible point. Deep Research-like systems become more useful when treated as query-planning runtimes with optimizer support, not as autonomous analysts roaming the data lake with a badge and a token budget.
The distinction is not pedantic. It changes procurement, architecture, and evaluation.
What Businesses Should Infer: Build the Runtime, Not the Demo
A demo can answer one impressive question over a folder of documents. A runtime has to answer the fifth question, the follow-up question, the recurring monthly question, and the legally sensitive question whose answer must be reproducible six months later.
The paper suggests a practical pathway for companies dealing with large unstructured corpora:
-
Wrap messy corpora as governed contexts. A
Contextshould not be a vague folder pointer. It should describe the corpus, expose safe access methods, include useful indexes, and provide small, well-documented tools. In enterprise environments, those tools also need permission boundaries. “Search everything” is rarely a governance policy. -
Separate planning from execution. Let agents decide how to decompose the analytical task, but route expensive document processing through semantic operators that can be optimized. The agent should not manually reinvent batch processing in Python every time someone asks a nuanced question.
-
Measure recall, not just eloquence. The Enron result is a useful reminder: a plain agent can look good because its returned items are mostly correct. But if it misses half the relevant documents, the business process fails quietly. Compliance and diligence workflows need recall-sensitive evaluation, not just plausible summaries.
-
Treat LLM calls as query-plan costs. LLM usage should be optimized like compute, storage, and network movement. Model selection, operator ordering, filtering strategy, and reuse all affect cost and latency. Token spend is not a mystical tax. It is a planning variable.
-
Reuse intermediate analytical work. Materialized contexts are potentially powerful because many enterprise questions recur with slight variation. If yesterday’s query identified the right subset of contracts, today’s query should not need to rediscover the same subset from scratch. The hard part, not fully solved here, is cache governance: freshness, invalidation, provenance, and access control.
The business value, then, is not simply “cheaper Deep Research”. It is cheaper, more disciplined, more reusable analytical execution over unstructured data.
That is a different category from chat-with-your-documents. Chat-with-your-documents is an interface. This is an execution layer.
A Useful Architecture Sketch for Enterprise Translation
The paper’s prototype is research code, not a shopping list. Still, its layers translate cleanly into enterprise design language.
| Layer | Research prototype idea | Enterprise analogue | Key question |
|---|---|---|---|
| User interface | Natural-language query passed to compute or search |
Analyst chat, notebook, workflow trigger, case-management UI | Is the user asking for exploration, extraction, filtering, or calculation? |
| Planning | CodeAgent decomposes the task and writes code | Agent planner with tool registry and policy constraints | Can the planner choose a safe and auditable route? |
| Data access | Context with iterator, indexes, tools, and description |
Data lake wrapper, vector index, metadata catalogue, permission-aware tools | Does the system know how to reach the right data without blindly reading everything? |
| Semantic execution | Optimized Palimpzest programs | LLM-powered filters, maps, joins, extractors with cost-aware execution | Are LLM calls ordered and selected intelligently? |
| Reuse | Materialized contexts indexed by description | Cached analytical artifacts, execution traces, reusable subsets, derived tables | Can future queries reuse prior work without violating freshness or access rules? |
| Governance | Not the paper’s main focus | Audit logs, lineage, policy enforcement, evaluation harnesses | Can the organization trust and reproduce the answer? |
This is where business leaders should resist the urge to buy the surface feature and ignore the machinery. “Deep Research” as a button may be useful. “Deep Research” as a runtime has to answer boring systems questions: What was read? Which filters ran? Which model judged which predicate? What was cached? When does the cache expire? How do we know recall did not collapse?
Boring questions are where enterprise value usually hides. Very inconsiderate of it.
Where the Paper’s Boundaries Actually Matter
The limitations are specific.
First, the evaluation is small. Two queries and three trials per system are enough to demonstrate failure modes and prototype promise. They are not enough to establish general performance across industries, corpora, languages, file types, security regimes, and query distributions. Any vendor converting these tables into a universal benchmark claim would be doing interpretive gymnastics.
Second, the Kramabench result improves accuracy but not latency. The prototype takes longer than the baselines on that task. For interactive analytics, that matters. If the system is used for high-stakes compliance review, the extra time may be acceptable. If it is expected to support rapid exploratory analysis, further optimization is needed.
Third, some optimizer ideas are explicitly future work. Logical rewrites that split or merge search and compute, dynamic insertion of search before failing compute operations, and richer model-selection strategies are discussed as opportunities, not as fully evaluated contributions. The materialized-context reuse mechanism is implemented in preliminary form but not deeply benchmarked.
Fourth, the paper does not solve enterprise governance. Context descriptions and execution traces are useful ingredients, but production deployments need access control, provenance, retention policy, evaluation sets, human review thresholds, and incident handling. The runtime idea makes governance more feasible; it does not magically provide it.
These boundaries do not make the paper less interesting. They make it easier to use correctly.
The Real Shift: From Agentic Answers to Optimized Analytical Execution
The paper’s strongest contribution is conceptual: it places Deep Research agents, semantic-operator systems, and OLAP databases on the same systems map. All three take a declarative request, generate a plan, execute that plan over data, and return a result. The difference is in how plans are represented, optimized, adapted, and reused.
Once seen this way, the future of enterprise AI analytics looks less like a super-agent and more like a hybrid runtime:
- agents for decomposition and adaptation;
- semantic operators for systematic LLM-powered processing;
- indexes and context descriptions for smarter data access;
- optimizers for cost, latency, and model choice;
- materialized contexts for reuse;
- governance layers for trust.
That is less glamorous than “ask anything, instantly”. It is also closer to how serious systems get built.
The prototype does not prove that Deep Research has conquered analytics. It shows why Deep Research, left as an agent pattern, is insufficient. The next useful step is to turn it into infrastructure: something that plans, executes, records, reuses, and optimizes.
In other words, the agent may write the plan. The runtime still has to keep it honest.
Cognaptus: Automate the Present, Incubate the Future.
-
Matthew Russo and Tim Kraska, “Deep Research is the New Analytics System: Towards Building the Runtime for AI-Driven Analytics,” arXiv:2509.02751, 2025. ↩︎