Opening — Why this matters now
The past year has crowned a new class of AI tools: “Deep Research” agents. They browse, summarize, and produce long-form reports with suspicious confidence. For a while, that was enough.
But cracks are showing.
Ask these systems anything requiring actual data reasoning—market structure shifts, policy impacts, or cross-domain comparisons—and they begin to hallucinate sophistication. The problem isn’t intelligence. It’s foundation.
Most AI research agents are still glorified web readers.
The paper Towards Knowledgeable Deep Research fileciteturn0file0 introduces a more uncomfortable idea: real research requires structured knowledge—tables, numbers, relationships—not just text. And more importantly, it requires systems that can think with data, not just quote it.
Background — Context and prior art
Deep Research (DR) agents emerged as the natural evolution of LLM capabilities:
| Capability Layer | Traditional LLM | Deep Research Agent |
|---|---|---|
| Information Access | Static knowledge | Web search + retrieval |
| Reasoning | Single-step | Multi-step workflows |
| Output | Short answers | Long-form reports |
However, as the paper points out, nearly all existing DR systems share a structural limitation:
They rely heavily on unstructured web content and lack meaningful interaction with structured data.
This leads to three predictable failures:
- Weak quantitative reasoning — numbers are cited, not analyzed
- Shallow conclusions — synthesis without computation
- Illusion of rigor — reports look analytical but lack data grounding
In other words, they resemble interns who read everything—but never opened Excel.
Analysis — What the paper actually does
1. Redefining the problem: Knowledgeable Deep Research (KDR)
The authors introduce a stricter formulation:
A research agent must reason over both structured (tables) and unstructured (text) knowledge to generate grounded reports.
Formally, the task expands from:
- “Find and summarize information”
To:
- “Retrieve, compute, validate, and synthesize across heterogeneous knowledge sources”
This is not incremental. It is architectural.
2. The HKA Framework: Divide and specialize
The proposed system—Hybrid Knowledge Analysis (HKA)—is a multi-agent pipeline with explicit functional separation:
| Component | Role | Failure it Fixes |
|---|---|---|
| Planner | Decomposes tasks | Avoids chaotic reasoning |
| Unstructured Analyzer | Handles web/text | Maintains context richness |
| Structured Analyzer | Handles tables + computation | Enables real analysis |
| Writer | Synthesizes outputs | Prevents fragmentation |
The key innovation is not the multi-agent setup (we’ve seen that before), but the Structured Knowledge Analyzer (SKA).
3. The real breakthrough: treating data as executable
Instead of stuffing tables into prompts (which is inefficient and brittle), the system:
- Converts tables into structured objects
- Generates Python code dynamically
- Executes computations
- Uses vision-language models to interpret outputs
This creates a pipeline where:
Data → Code → Execution → Insight → Narrative
Notably, the system includes retry mechanisms that reduce execution failure rates from 31.7% to 0.51%, and visual analysis errors from 55.5% to 1.7% fileciteturn0file0.
That’s not optimization. That’s operational viability.
4. Evaluation: KDR-Bench (where most papers get lazy)
Instead of generic benchmarks, the authors build a domain-diverse dataset:
| Metric Category | What It Measures | Why It Matters |
|---|---|---|
| General-purpose | Coherence, depth, readability | Surface quality |
| Knowledge-centric | Use of correct data & conclusions | Actual reasoning |
| Vision-enhanced | Use of figures and layout | Multimodal intelligence |
The dataset includes:
- 9 domains
- 41 expert-level questions
- 1,252 structured tables
This is unusually grounded for an LLM paper—and refreshingly difficult.
Findings — What actually works (and what doesn’t)
1. Structured reasoning is not optional
| System Type | Performance Trend |
|---|---|
| Web-only agents | High readability, low depth |
| Table-only agents | Higher depth, weaker synthesis |
| Hybrid (naive) | Marginal improvement |
| HKA (structured integration) | Strong across all metrics |
The key insight:
Simply adding data access is not enough. You need structured reasoning pipelines.
2. HKA vs industry systems
From the benchmark results:
| Metric | Best Baseline (Gemini DR) | HKA |
|---|---|---|
| General Score | 50.2 | 48.4 |
| Key Point Coverage | 58.3 | 61.7 |
| Supportiveness | ~20–21 (others) | 27.8 |
Interpretation:
- Gemini still wins on general fluency (unsurprising)
- HKA dominates where it matters: data-backed reasoning
A subtle but critical distinction.
3. Vision matters more than expected
When evaluated with multimodal judges:
- HKA outperforms Gemini in overall win rate
- Generates ~5.75 figures per report vs ~2.0 baseline in some domains
Translation: the ability to show reasoning (not just describe it) is now a competitive edge.
4. Ablation reveals the uncomfortable truth
Removing either:
- Structured analyzer → major drop
- Unstructured analyzer → also drops
Conclusion:
Real research is inherently hybrid. Any single-mode system is incomplete.
Implications — What this means for AI builders
1. “Search + summarize” is a dead-end product
If your AI tool:
- Retrieves documents
- Summarizes them
- Calls it “research”
…it’s already obsolete.
The next generation must:
- Execute computations
- Validate outputs
- Generate evidence (charts, tables)
2. Code generation becomes a core reasoning layer
This paper quietly confirms a trend:
The most reliable way for LLMs to reason about data is to write and execute code.
Expect future architectures to standardize:
- Code-first reasoning
- Data schema abstraction
- Execution-aware feedback loops
3. Evaluation frameworks will redefine competition
Benchmarks like KDR-Bench expose a gap:
- Current systems optimize for textual plausibility
- Future systems will optimize for data correctness
This shift will reshape leaderboards—and product claims.
4. Multimodal outputs are not UI polish—they’re cognition
Figures, tables, and layouts are not presentation layers anymore.
They are part of the reasoning process.
That’s a conceptual shift most companies haven’t priced in.
Conclusion — The quiet transition from language to knowledge
The industry has spent two years teaching machines to speak.
Now it has to teach them to think.
This paper makes a simple but disruptive argument:
Intelligence is not the ability to generate text—it’s the ability to reason over structured reality.
And that requires something most LLM systems still avoid:
Discipline.
Not more parameters. Not better prompts.
Just better thinking.
Cognaptus: Automate the Present, Incubate the Future.