Opening — Why this matters now

The past year has crowned a new class of AI tools: “Deep Research” agents. They browse, summarize, and produce long-form reports with suspicious confidence. For a while, that was enough.

But cracks are showing.

Ask these systems anything requiring actual data reasoning—market structure shifts, policy impacts, or cross-domain comparisons—and they begin to hallucinate sophistication. The problem isn’t intelligence. It’s foundation.

Most AI research agents are still glorified web readers.

The paper Towards Knowledgeable Deep Research fileciteturn0file0 introduces a more uncomfortable idea: real research requires structured knowledge—tables, numbers, relationships—not just text. And more importantly, it requires systems that can think with data, not just quote it.

Background — Context and prior art

Deep Research (DR) agents emerged as the natural evolution of LLM capabilities:

Capability Layer Traditional LLM Deep Research Agent
Information Access Static knowledge Web search + retrieval
Reasoning Single-step Multi-step workflows
Output Short answers Long-form reports

However, as the paper points out, nearly all existing DR systems share a structural limitation:

They rely heavily on unstructured web content and lack meaningful interaction with structured data.

This leads to three predictable failures:

  1. Weak quantitative reasoning — numbers are cited, not analyzed
  2. Shallow conclusions — synthesis without computation
  3. Illusion of rigor — reports look analytical but lack data grounding

In other words, they resemble interns who read everything—but never opened Excel.

Analysis — What the paper actually does

1. Redefining the problem: Knowledgeable Deep Research (KDR)

The authors introduce a stricter formulation:

A research agent must reason over both structured (tables) and unstructured (text) knowledge to generate grounded reports.

Formally, the task expands from:

  • “Find and summarize information”

To:

  • “Retrieve, compute, validate, and synthesize across heterogeneous knowledge sources”

This is not incremental. It is architectural.

2. The HKA Framework: Divide and specialize

The proposed system—Hybrid Knowledge Analysis (HKA)—is a multi-agent pipeline with explicit functional separation:

Component Role Failure it Fixes
Planner Decomposes tasks Avoids chaotic reasoning
Unstructured Analyzer Handles web/text Maintains context richness
Structured Analyzer Handles tables + computation Enables real analysis
Writer Synthesizes outputs Prevents fragmentation

The key innovation is not the multi-agent setup (we’ve seen that before), but the Structured Knowledge Analyzer (SKA).

3. The real breakthrough: treating data as executable

Instead of stuffing tables into prompts (which is inefficient and brittle), the system:

  1. Converts tables into structured objects
  2. Generates Python code dynamically
  3. Executes computations
  4. Uses vision-language models to interpret outputs

This creates a pipeline where:

Data → Code → Execution → Insight → Narrative

Notably, the system includes retry mechanisms that reduce execution failure rates from 31.7% to 0.51%, and visual analysis errors from 55.5% to 1.7% fileciteturn0file0.

That’s not optimization. That’s operational viability.

4. Evaluation: KDR-Bench (where most papers get lazy)

Instead of generic benchmarks, the authors build a domain-diverse dataset:

Metric Category What It Measures Why It Matters
General-purpose Coherence, depth, readability Surface quality
Knowledge-centric Use of correct data & conclusions Actual reasoning
Vision-enhanced Use of figures and layout Multimodal intelligence

The dataset includes:

  • 9 domains
  • 41 expert-level questions
  • 1,252 structured tables

This is unusually grounded for an LLM paper—and refreshingly difficult.

Findings — What actually works (and what doesn’t)

1. Structured reasoning is not optional

System Type Performance Trend
Web-only agents High readability, low depth
Table-only agents Higher depth, weaker synthesis
Hybrid (naive) Marginal improvement
HKA (structured integration) Strong across all metrics

The key insight:

Simply adding data access is not enough. You need structured reasoning pipelines.

2. HKA vs industry systems

From the benchmark results:

Metric Best Baseline (Gemini DR) HKA
General Score 50.2 48.4
Key Point Coverage 58.3 61.7
Supportiveness ~20–21 (others) 27.8

Interpretation:

  • Gemini still wins on general fluency (unsurprising)
  • HKA dominates where it matters: data-backed reasoning

A subtle but critical distinction.

3. Vision matters more than expected

When evaluated with multimodal judges:

  • HKA outperforms Gemini in overall win rate
  • Generates ~5.75 figures per report vs ~2.0 baseline in some domains

Translation: the ability to show reasoning (not just describe it) is now a competitive edge.

4. Ablation reveals the uncomfortable truth

Removing either:

  • Structured analyzer → major drop
  • Unstructured analyzer → also drops

Conclusion:

Real research is inherently hybrid. Any single-mode system is incomplete.

Implications — What this means for AI builders

1. “Search + summarize” is a dead-end product

If your AI tool:

  • Retrieves documents
  • Summarizes them
  • Calls it “research”

…it’s already obsolete.

The next generation must:

  • Execute computations
  • Validate outputs
  • Generate evidence (charts, tables)

2. Code generation becomes a core reasoning layer

This paper quietly confirms a trend:

The most reliable way for LLMs to reason about data is to write and execute code.

Expect future architectures to standardize:

  • Code-first reasoning
  • Data schema abstraction
  • Execution-aware feedback loops

3. Evaluation frameworks will redefine competition

Benchmarks like KDR-Bench expose a gap:

  • Current systems optimize for textual plausibility
  • Future systems will optimize for data correctness

This shift will reshape leaderboards—and product claims.

4. Multimodal outputs are not UI polish—they’re cognition

Figures, tables, and layouts are not presentation layers anymore.

They are part of the reasoning process.

That’s a conceptual shift most companies haven’t priced in.

Conclusion — The quiet transition from language to knowledge

The industry has spent two years teaching machines to speak.

Now it has to teach them to think.

This paper makes a simple but disruptive argument:

Intelligence is not the ability to generate text—it’s the ability to reason over structured reality.

And that requires something most LLM systems still avoid:

Discipline.

Not more parameters. Not better prompts.

Just better thinking.


Cognaptus: Automate the Present, Incubate the Future.