Beyond Answers: Measuring How Deep Research Agents Really Think

Artificial intelligence is moving past chatbots that answer questions. The next frontier is Deep Research Agents (DRAs) — AI systems that can decompose complex problems, gather information from multiple sources, reason across them, and synthesize their findings into structured reports. But until recently, there was no systematic way to measure how well these agents perform beyond surface-level reasoning.

That is the gap RigorousBench aims to fill.

From Q&A to Reports: The Benchmark Shift

Traditional LLM benchmarks — like GAIA, WebWalker, or BrowseComp — test how accurately a model answers factual questions. This approach works for short-form reasoning but fails for real-world research tasks that demand long-form synthesis and multi-source validation.

RigorousBench redefines evaluation by focusing on report-style generation instead of snippets. Its 214 queries span ten professional domains, from law and finance to health and sustainability. Each entry is accompanied by an expert-built reference bundle, including:

Component	Purpose
QSRs (Query-Specific Rubrics)	Task-specific evaluation rules to measure factual and logical accuracy
GRRs (General-Report Rubrics)	Universal criteria assessing structure, coherence, and citation quality
TSLs (Trustworthy-Source Links)	Human-curated authoritative sources for verifying retrieved data
FAKs / FDKs	Focus Anchor & Deviation Keywords to detect topic relevance and drift

This design transforms evaluation into something closer to peer review than to multiple-choice testing.

A Multidimensional Evaluation Framework

Instead of scoring by overlap or similarity, RigorousBench applies a three-axis evaluation model:

Semantic Quality (QSR + GRR): Measures factual precision, reasoning, and writing structure.
Topical Focus (1 − Semantic Drift): Rewards on-topic synthesis while penalizing tangents.
Retrieval Trustworthiness: Assesses citation credibility through a TrustworthyBoost derived from accurate source matches.

These dimensions merge into a single integrated score:

IntegratedScore = Quality × (1 − Drift) × TrustworthyBoost × 100

It’s a simple formula with profound implications: truth and coherence matter as much as fluency.

What the Results Reveal

Thirteen models — five DRAs, one hybrid agent, and seven search-augmented LLMs — were benchmarked. The results are telling:

Qwen Deep Research achieved the highest overall score, balancing precision and coherence.
Sonar Deep Research excelled in topical focus, showing disciplined reasoning.
Kimi-K2, though not a DRA, led in writing quality thanks to its massive parameter count.
GPT-5 stood out for the credibility of its cited sources.

Yet the study also exposes trade-offs:

Efficiency vs. Quality: DRAs generate excellent reports but consume huge token budgets.
Decomposition vs. Coherence: Breaking problems into sub-tasks sometimes fractures meaning.
Stability vs. Adaptiveness: Even advanced agents like o3 or o4-mini showed erratic retrieval patterns.

Why It Matters

RigorousBench moves AI evaluation closer to how humans judge research quality — not by right answers, but by how well evidence is gathered, reasoning is explained, and uncertainty is managed. It highlights the maturation of LLMs into cognitive systems rather than statistical parrots.

For developers, it provides diagnostic visibility: where the reasoning broke, where the citations failed, where the focus drifted. For the AI community, it redefines what “performance” means in an age when the best models write full research reports.

Cognaptus: Automate the Present, Incubate the Future.

From Q&A to Reports: The Benchmark Shift#

A Multidimensional Evaluation Framework#

What the Results Reveal#

Why It Matters#

From Q&A to Reports: The Benchmark Shift

A Multidimensional Evaluation Framework

What the Results Reveal

Why It Matters