Beyond Answers: Measuring How Deep Research Agents Really Think

A research report is not an answer with extra paragraphs.

That sounds obvious until an enterprise team tries to evaluate a deep research agent by asking whether its final conclusion looks plausible, whether it included citations, and whether the prose sounded confident enough to survive a board deck. Congratulations: the machine has produced something that resembles diligence. Whether it actually performed diligence is the inconvenient question.

This is where Dr. Bench matters. In Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports, Yang Yao and colleagues propose a benchmark built specifically for Deep Research Agents, or DRAs: systems expected to decompose complex tasks, retrieve across sources, reason through intermediate steps, and produce structured long-form reports rather than short factual answers.¹ The paper’s real contribution is not that it adds another leaderboard to the already crowded trophy cabinet. It changes what the leaderboard is allowed to measure.

The comparison is the point. Traditional benchmarks ask: did the model return the right answer? Dr. Bench asks a more operationally useful question: did the system perform the work in a way that a serious user could trust, inspect, and pay for without quietly regretting it later?

That difference is not academic decoration. It is the difference between buying an answer machine and buying a research operation.

The old benchmark asks whether the answer lands; Dr. Bench asks whether the work holds together

Most AI benchmarks are designed around outputs that are easy to score. A model answers a question. The evaluator checks exact match, string overlap, similarity, or a judge-model rating. This works reasonably well when the target is a short answer: a date, a named entity, a number, a multiple-choice option, or a compact explanation.

Deep research agents break that pattern. Their output is not a single fact but a report. A useful report has coverage, structure, source quality, topical discipline, synthesis, and proportionality. It may be wrong not because its final conclusion is false, but because it ignored a key sub-question, cited weak sources, wandered into irrelevant material, or spent 25,000 tokens achieving what a disciplined analyst could have achieved in 8,000.

That is the paper’s core correction to the familiar misconception: deep research quality is not the same as longer output plus more citations. Longer output can simply mean more room for drift. More citations can simply mean the model retrieved widely and filtered badly. A polished report can still be operationally useless if it fails the task-specific requirements that motivated the research request in the first place.

Dr. Bench responds by building the evaluation around report behaviour. It contains 214 expert-curated tasks across 10 broad domains, including academia and research, news and current affairs, law and politics, business and finance, technology intelligence, environment and sustainability, history and social sciences, and health and medicine. The domain spread is not a cosmetic detail. It prevents the benchmark from becoming a test of one narrow research style, such as academic survey writing or factual web browsing.

The important design choice is that each task is paired with a reference bundle, not merely a reference answer. That bundle includes query-specific rubrics, general report rubrics, trustworthy-source links, focus-anchor keywords, and focus-deviation keywords. In plain English: the benchmark tries to know what a good report should cover, what good report-writing looks like in general, which sources are authoritative, which concepts should stay central, and which distractions indicate the model has wandered off like a consultant discovering billable scope.

The benchmark is built like an audit file, not a quiz bank

Dr. Bench’s components map neatly onto the failure modes of enterprise research agents.

Benchmark component	What it checks	Enterprise failure it catches
Query-Specific Rubrics	Whether the report satisfies the particular task requirements	The agent writes something impressive but misses the actual assignment
General-Report Rubrics	Whether the report is structured, coherent, rigorous, cited, and readable	The agent dumps information without producing usable analysis
Trustworthy-Source Links	Whether citations align with authoritative sources	The agent cites whatever it found first, which is adorable in interns and expensive in machines
Focus-Anchor Keywords	Whether core concepts remain present and meaningfully discussed	The agent under-covers the central topic
Focus-Deviation Keywords	Whether likely distractions appear inappropriately	The agent drifts into adjacent but irrelevant material

The query-specific rubrics are especially important. Each task has at least eight such rubrics, with a total score of 30. These are not generic “was this helpful?” vibes. They are task-specific checks for factual accuracy, mechanism explanation, source verification, temporal logic, comparative analysis, methodology, and other domain-relevant requirements.

The general-report rubrics then evaluate qualities that apply across tasks: structure, logical clarity, informational depth, citation quality, originality, data usage, analytical rigor, and formatting consistency. The paper reports 48 general-report rubrics with a total score of 73. This makes the benchmark less brittle than one-off answer keys. A good report must satisfy both the local contract of the prompt and the broader standards of report quality.

The trustworthy-source links are the benchmark’s attempt to separate retrieval from mere browsing. Experts designate durable, authoritative, primary or official links where possible. The system then evaluates whether model annotations match these sources exactly or at least match the same hostnames. That choice is practical. Exact URLs decay. Domains and institutions are more stable. A benchmark that cannot survive link rot becomes a museum exhibit with JSON.

The focus-anchor and focus-deviation keywords add another layer: topical discipline. The model must not merely produce enough content; it must keep attention on the right content. This matters because deep research agents often decompose tasks into sub-queries. Decomposition improves coverage, but it can also fracture intent. Once the agent starts chasing loosely related threads, the final report may become broad, fluent, and quietly useless.

The scoring formula rewards quality, punishes drift, and boosts credible retrieval

Dr. Bench combines three main dimensions: Semantic Quality, Topical Focus, and Retrieval Trustworthiness.

The integrated score is conceptually simple:

$$ IntegratedScore = Quality \times (1 - SemanticDrift) \times TrustworthyBoost \times 100 $$

This formula is useful because it refuses to let one attractive dimension hide another broken one. A report with strong writing quality can still lose ground if it drifts off topic. A focused report can gain credibility if it uses authoritative sources. A citation-heavy report does not automatically win unless those citations align with trusted references.

That multiplicative structure matters. In many enterprise evaluations, teams implicitly use additive thinking: good writing plus many citations plus plausible conclusion equals acceptable. Multiplicative scoring is harsher. It says that weak focus or weak trustworthiness should dampen the value of quality, not merely subtract a few points from a nice essay.

This is closer to how real research review works. A beautifully written market analysis that cites weak sources is not “mostly fine.” A technically accurate policy report that omits the central legal constraint is not “pretty good.” A health research summary that drifts into adjacent wellness chatter is not “comprehensive.” It is off-task with better formatting.

The main experiment is evidence for system-level differences, not a divine vendor ranking

The main experiment evaluates 13 models: five mainstream deep research agents, one advanced agent model, and seven reasoning models enhanced with web-search tools. This is the paper’s main evidence, not a side experiment. The leaderboard is designed to test whether the benchmark can distinguish report-generation performance across quality, focus, trustworthiness, token usage, and contribution density.

The headline result is that mainstream DRAs generally outperform web-search-tool-augmented reasoning models on integrated report performance. But the more useful finding is not “DRA good, search model bad.” That would be too simple, and therefore suspiciously convenient.

The top integrated scores in the main table are:

Model	Quality	Semantic Drift	TrustworthyBoost	IntegratedScore	Average token usage
Qwen Deep Research	0.6348	0.5248	1.0288	34.6480	9,258
Sonar Deep Research	0.6184	0.5271	1.0238	33.4668	8,254
o3 Deep Research	0.6176	0.5184	1.0171	32.9004	25,038
Kimi-K2 Preview	0.6707	0.4671	1.0153	32.0651	2,079
Grok-4 Search	0.6130	0.4890	1.0283	31.3490	3,012

Qwen leads overall. Sonar follows closely. o3 Deep Research is strong but expensive in token usage. Kimi-K2 records the highest Quality score among the listed models but does not win overall because its focus and trustworthiness factors weaken the integrated result. GPT-5, meanwhile, records the highest TrustworthyBoost in the main table, showing that source credibility can be a relative strength even when integrated report performance is not the highest.

This is exactly the kind of evidence enterprises should want. Not a single magic number. A capability profile.

A procurement team does not only need to know which model scored highest. It needs to know whether a system is strong because it writes better, cites better, stays focused better, searches more aggressively, or burns a ridiculous amount of compute to get there. These are different operational behaviours. They imply different costs, governance needs, and deployment roles.

Kimi-K2 is the useful exception because it proves quality alone is not enough

One of the most interesting results is Kimi-K2. It achieves the highest Quality score, yet ranks below Qwen, Sonar, and o3 in IntegratedScore. That is not an inconsistency. It is the benchmark doing its job.

Quality captures report substance and expression. But Dr. Bench’s full score also includes topical focus and retrieval trustworthiness. Kimi-K2’s strong writing and task-completion quality do not fully compensate for its weaker performance on the multiplicative factors. The appendix’s quality-share analysis reinforces this point: a model can look impressive on the quality axis while losing integrated value when credibility and attention are less balanced.

For business users, this is the uncomfortable lesson. The best writer is not necessarily the best researcher. A model can produce fluent, structured, persuasive reports while still being less reliable as a research workflow. This distinction matters in due diligence, regulatory analysis, market intelligence, legal monitoring, medical information synthesis, and policy assessment. In those settings, elegance is not evidence. It is packaging. Useful packaging, yes. Still packaging.

The practical takeaway is not to ignore writing quality. It is to stop treating writing quality as a proxy for process quality. That proxy was always lazy. Dr. Bench simply makes it measurable enough to embarrass us.

The supplementary tests explain behaviour rather than rewriting the main result

The paper includes supplementary dimensions for OpenAI models, including reasoning times, search times, and retrieval indices. These are best read as implementation and behavioural diagnostics, not as a second thesis.

The timing data show that GPT-4.1 has minimal retrieval activity, with an average search count of 0.3925. GPT-5 is more active, with average reasoning and search counts around 12.93 and 10.67. The deep-research variants are much more intensive: o3 Deep Research averages 55.43 reasoning steps and 16.10 searches, while o4-mini Deep Research averages 63.99 reasoning steps and 26.51 searches.

This supports the paper’s claim that deep research agents follow more complex reasoning and retrieval patterns. But it does not prove that more searching is always better. In fact, the token and efficiency results suggest the opposite: more process can produce better reports, but it can also produce cost, latency, and instability.

The retrieval index comparison between o3 Deep Research and o4-mini Deep Research is similarly diagnostic. o4-mini shows a slightly lower RetrievalIndex, which the paper interprets as stronger filtering and citation precision. This is useful, but it should not be overread. RetrievalIndex is a signal about selection behaviour, not a full measure of report correctness.

The appendix’s domain-level table serves a different purpose: robustness and sensitivity across domains. It shows that top models generally perform strongly in Sports & Competitions and Health & Medicine, while maintaining more balanced but varied performance elsewhere. Qwen, Sonar, and o3 appear consistently strong across multiple domains. The lower-performing Claude variants form a weaker floor in the paper’s evaluation.

That does not mean these systems are universally better or worse in all real-world workflows. It means Dr. Bench can reveal domain sensitivity. For enterprise teams, that is more useful than average scores. A model that performs well on health summaries may not be the right model for legal-policy analysis. A model that excels in business and finance may still drift under history-and-social-science tasks. A single leaderboard rank is where nuance goes to die, usually wearing a nice badge.

The comparison with prior benchmarks is the real strategic move

Appendix F compares Dr. Bench with benchmarks such as WebWalker, BrowseComp, GAIA, WideSearch, Deep Research Bench, ResearchQA, ReportBench, and DeepResearch Arena. This comparison is not mere literature review. It clarifies the category shift.

Older or adjacent benchmarks often focus on closed tasks, short string answers, exact matching, F1, recall, or broad LLM-based criteria. Several newer benchmarks address open report-style tasks, but many rely heavily on automatically generated rubrics or generic LLM evaluation criteria.

Dr. Bench positions itself differently: open-ended report tasks, human-authored entries, expert rubrics, keywords, trustworthy links, and evaluation across quality, semantics, and retrieval credibility.

Here is the practical contrast:

Evaluation style	What it rewards	What it misses
Short-answer benchmarks	Getting the final fact right	Report structure, synthesis, source quality, drift, workflow behaviour
Generic LLM-as-judge report scoring	Plausible high-level quality assessment	Transparent task-specific criteria and stable human expectations
Citation consistency checks	Whether claims match cited links	Whether the cited sources are authoritative enough
Dr. Bench-style evaluation	Task completion, report quality, focus, trusted retrieval, efficiency signals	Still depends on curated tasks, rubric design, and judge-model implementation

That last row matters. Dr. Bench is stronger because it is more specific. It is also bounded because it is more specific. The benchmark is a serious evaluation template, not a universal oracle.

What this changes for enterprise AI evaluation

The business implication is straightforward: if an organisation is adopting deep research agents, it should stop evaluating them as chatbots with footnotes.

The paper directly shows that report-style evaluation can distinguish models across semantic quality, topical drift, source trustworthiness, and resource usage. It also shows that DRAs tend to outperform standard search-augmented reasoning models on complex report tasks, while exhibiting meaningful cost and stability trade-offs.

Cognaptus infers a broader operational lesson: enterprise teams need research-agent scorecards, not demo prompts.

A usable scorecard should ask at least five questions:

Task fidelity: Did the report satisfy the actual business question, including hidden sub-requirements?
Source authority: Did it rely on primary, official, durable, or otherwise trusted sources?
Topical discipline: Did it stay focused, or did decomposition create irrelevant branches?
Synthesis quality: Did it integrate evidence into a structured argument rather than paste together search fragments?
Execution efficiency: How many tokens, searches, steps, and citations were required to produce the result?

This is especially important for high-value workflows: investment research, regulatory monitoring, competitor intelligence, legal scanning, scientific review, risk analysis, procurement diligence, and policy analysis. In these settings, the cost of a bad report is rarely the text itself. It is the decision that follows the text.

The right enterprise question is not “Which model writes the best report?” It is “Which system produces inspectable research under our domain constraints, at an acceptable cost, with failure modes we can detect before they become decisions?”

Less glamorous. Much more useful. The market will recover.

The cost problem is not just token usage; it is uncontrolled search behaviour

The paper’s discussion identifies two systemic limitations: unstable invocation behaviour and incoherent semantic decomposition. Both matter because they turn deep research from a capability into an operating-risk problem.

Unstable invocation behaviour means the agent’s reasoning and search patterns vary substantially across queries. In business terms, this makes cost and latency unpredictable. A system that sometimes answers in a disciplined path and sometimes goes spelunking through the internet is difficult to price, govern, and deploy at scale.

Incoherent decomposition is subtler. Deep research agents often break a task into sub-queries. That should improve coverage. But if the sub-queries drift semantically, switch language unexpectedly, or become unintelligible to human evaluators, decomposition stops being a strength and becomes a source of noise. The agent is no longer solving the original problem. It is managing a committee of confused sub-problems.

This is why the efficiency-quality trade-off and decomposition-coherence trade-off are central. Enterprises do not merely need deeper agents. They need agents with bounded depth, controlled retrieval, and coherent decomposition. Otherwise “deep research” becomes a polite term for expensive wandering.

Where Dr. Bench should not be overused

Dr. Bench is a strong evaluation framework, but its boundaries matter.

First, the benchmark contains 214 curated tasks. That is enough for meaningful evaluation, but not enough to represent every enterprise domain, internal data environment, regulatory jurisdiction, or proprietary research workflow. A bank, hospital, construction group, government contractor, or asset manager would still need domain-specific rubrics.

Second, the scoring depends on expert-designed rubrics, keywords, and trusted-source links. That is a strength because it improves interpretability. It is also a design dependency. Different experts may prioritise different sources, evaluation criteria, or levels of acceptable detail.

Third, semantic scoring uses GPT-4o as the LLM judge, with approximately 35% manual verification and reported 99.3% agreement. That is reassuring, but not the same as eliminating judge-model bias. It means the evaluation process is disciplined enough to be useful, not metaphysically pure. Anyone promising metaphysical purity in AI evaluation should be escorted gently away from the procurement meeting.

Fourth, the tested models are specific versions. Model behaviour changes. Vendor systems change. Tool access changes. Search infrastructure changes. A leaderboard is therefore a snapshot, not a constitution.

Finally, the TrustworthyBoost design rewards alignment with curated authoritative links. This is valuable for tasks where stable trusted sources exist. It may be harder to apply in domains where evidence is fragmented, proprietary, rapidly changing, or contested. In those contexts, the same principle should be retained, but the source policy must be adapted.

The business value is better diagnosis, not just better ranking

The most useful thing about Dr. Bench is not that Qwen beats Sonar by a narrow margin, or that Kimi-K2 writes well, or that o3 spends heavily. Those results are interesting, but rankings age quickly.

The durable contribution is diagnostic structure.

Dr. Bench gives evaluators a way to ask where a research agent failed. Did it misunderstand the task? Did it under-cover required subtopics? Did it cite weak sources? Did it drift? Did it search too much? Did it search too little? Did it write well while researching poorly? Did it perform strongly in one domain and weakly in another?

That is what enterprises need. Not another demo where an agent produces a confident 12-page report on market expansion. A useful evaluation must tell the buyer whether the report is grounded, focused, complete, and economically sane.

Deep research agents are moving AI from answers toward workflows. That shift requires a matching shift in measurement. Dr. Bench is not perfect, but it asks the right kind of uncomfortable question: not “Did the model answer?” but “Can we trust the way it researched?”

For business users, that is the difference between automation theatre and operational intelligence. One produces impressive documents. The other survives contact with decisions.

Cognaptus: Automate the Present, Incubate the Future.

Yang Yao, Yixu Wang, Yuxuan Zhang, Yi Lu, Tianle Gu, Lingyu Li, Dingyi Zhao, Keming Wu, Haozhe Wang, Ping Nie, Yan Teng, and Yingchun Wang, “Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports,” arXiv:2510.02190, submitted 2 October 2025, revised 29 January 2026. ↩︎

The old benchmark asks whether the answer lands; Dr. Bench asks whether the work holds together#

The benchmark is built like an audit file, not a quiz bank#

The scoring formula rewards quality, punishes drift, and boosts credible retrieval#

The main experiment is evidence for system-level differences, not a divine vendor ranking#

Kimi-K2 is the useful exception because it proves quality alone is not enough#

The supplementary tests explain behaviour rather than rewriting the main result#

The comparison with prior benchmarks is the real strategic move#

What this changes for enterprise AI evaluation#

The cost problem is not just token usage; it is uncontrolled search behaviour#

Where Dr. Bench should not be overused#

The business value is better diagnosis, not just better ranking#