Sequential Beats Parallel: When Deep Research Agents Learn to Reflect

A research request usually begins with a deceptively harmless sentence: “Can you give me the full picture?”

Then comes the usual enterprise ritual. Someone breaks the topic into pieces. One person checks competitors. Another checks regulation. Another reads technical reports. Another searches recent news. Everyone works quickly. Everyone returns with fragments. Then one unlucky analyst stitches the fragments into a report and pretends the seams are a design choice.

Deep research agents often copy this workflow. They fan out sub-questions, run searches in parallel, and merge the outputs at the end. It is fast. It is clean. It also has the same old problem: the left hand does not know what the right hand has already discovered.

The paper Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve) argues for a different architecture.¹ Its core claim is not simply that sequential processing is magically better than parallel processing. That would be a nice slogan, and therefore suspicious. The more useful claim is narrower: when a research task depends on discoveries made during the process, the agent should update the research plan while researching, not merely polish the report afterward.

That difference sounds small. It is not. It moves intelligence from the end of the pipeline into the middle.

Parallel deep research agents solve a real problem: latency. If a complex topic can be decomposed into independent branches, parallelization is attractive. Search regulatory filings while another branch reviews product strategy. Compare market data while another branch reads customer complaints. Good. Nobody is asking an AI agent to rediscover the beauty of waiting.

The paper’s criticism is more specific. Parallel scaling becomes brittle when the branches are not truly independent. One branch may discover that the original framing is wrong. Another may find a source that makes two other searches redundant. A third may reveal that the important question is not “what happened?” but “why did this pattern appear only in one market?”

A static parallel architecture has difficulty using these discoveries because each sub-agent operates inside its assigned task. It may produce useful local findings, but the system has no strong mechanism for converting those findings into a revised global research plan. The final report writer receives fragments. Sometimes good fragments. Sometimes repeated fragments. Sometimes fragments that clearly wanted to talk to each other but were kept in separate rooms for organizational reasons.

Deep Researcher Reflect–Evolve is built around the opposite assumption: research is a feedback loop.

The system starts with a research plan, generates a search query, answers that query, stores the result in a centralized Global Research Context, reflects on the current plan, updates the plan if needed, estimates research progress, and continues until a threshold is reached. The final report is generated in one shot from the full accumulated context.

The simplified loop looks like this:

Research topic
   ↓
Initial research plan
   ↓
Search query generated with global context
   ↓
Candidate Crossover answers the query
   ↓
Global Research Context stores query, answer, and artifacts
   ↓
Planning agent reflects on gaps and redundancy
   ↓
Research plan is updated, or left unchanged
   ↓
Progress is estimated
   ↓
One-shot report generation

The important object being optimized is not the final prose. It is the research trajectory.

The paper compares three places to put reflection

The easiest way to read this paper is as a comparison among three orchestration patterns.

Pattern	Where the intelligence sits	Operational advantage	Failure mode
Static parallel research	At the initial decomposition stage	Fast, scalable, easy to schedule	Siloed knowledge and weak mid-course correction
Report-level refinement	After a draft exists	Can improve structure and wording	Too late to fix missing research paths efficiently
Sequential plan reflection	During the research loop	Can redirect search based on discoveries	Potentially slower and harder to control

This is why the paper’s title-level message needs careful handling. “Sequential beats parallel” is a useful editorial hook, but the architecture itself is not anti-parallel. In fact, one of its two main contributions, Candidate Crossover, uses multiple candidates in parallel for each search query.

So the better distinction is this: parallelism is acceptable when it is local and supervised by shared memory. It becomes risky when it replaces shared memory.

That is the paper’s real architectural idea. Do not let independent agents run off into the forest and return with incompatible souvenirs. Give the system a shared notebook. Then make the planner read the notebook before deciding where to search next.

The Global Research Context is the actual control layer

The paper names the central memory module the Global Research Context. It stores search trajectories, query answers, and contextual artifacts collected from web search. The planning agent uses this memory to avoid redundant searches, detect uncovered areas, and decide whether the plan should change.

This matters because many AI workflow discussions treat memory as a storage convenience. Save previous outputs. Retrieve them later. Add some vector search. Sprinkle “agentic” on top. Done, apparently.

Here, memory is not just retrieval. It is control.

The search agent reads the current research plan and the global context before generating the next query. The planning agent reads the same context when reflecting on progress. The report writer uses the final context for one-shot synthesis. In other words, the memory layer is not a passive archive; it is the shared state that lets the system reason across iterations.

For enterprise research automation, that is the useful lesson. The hard part is not only getting more documents into the system. The hard part is deciding what the next search should be after the system has learned something.

A market intelligence agent that keeps searching the same competitor because three sub-agents phrased the same question differently is not “thorough.” It is expensive with confidence.

Candidate Crossover is parallelism in a small box

The second contribution is Candidate Crossover. For each search query, the system initializes multiple LLM candidates. In the paper’s evaluation, this is set to $n=3$. The candidates use varied model settings such as temperature and top-k, allowing them to explore different parts of the response space. Each candidate receives the search query and web search artifacts, then generates a concise answer. These answers are merged into a consolidated response that enters the Global Research Context.

The paper uses Tavily for web search, aims to retrieve the top five search results, and filters out search results whose relevance score is below a 30% threshold. These details matter because Candidate Crossover is not just “ask the model three times.” It is a controlled mini-ensemble over a specific search query, with retrieval artifacts and a merging step.

The authors explicitly distinguish this from the more elaborate self-evolution process in Google’s Test-Time Diffusion Deep Research work. They keep the initial diverse candidate generation and crossover idea, but remove environmental feedback and revision steps to reduce latency and inference-time complexity.

That design choice is practical. It also creates an evidence boundary.

Candidate Crossover sounds like it should improve factual coverage. It probably does in many cases. But the paper does not isolate it with a full ablation showing how much of the final score comes from crossover versus plan reflection versus the base model versus retrieval settings. So, for business interpretation, the right conclusion is not “Candidate Crossover is proven to be the magic component.” The right conclusion is: local parallel sampling may be useful when its outputs are merged into a shared sequential context.

That is less glamorous. It is also more likely to survive procurement.

The benchmark result is competitive, not a coronation

The system is evaluated on DeepResearch Bench, which contains 100 doctoral-level research tasks across 22 fields and two languages. The paper reports performance under the RACE framework, covering comprehensiveness, insight, instruction following, and readability.

The headline number is an overall RACE score of 46.21. That places Deep Researcher Reflect–Evolve above several listed systems, including Claude Research, Nvidia AIQ Research Assistant, Perplexity Research, and Grok Deeper Search in the paper’s comparison table. It is close to OpenAI Deep Research at 46.45, but still below it. It is also below Gemini 2.5 Pro Deep Research at 49.71 and Tavily Research at 52.44.

Model	Overall	Comprehensiveness	Insight	Instruction following	Readability
Tavily Research	52.44	52.84	53.59	51.92	49.21
Gemini 2.5 Pro Deep Research	49.71	49.51	49.45	50.12	50.00
OpenAI Deep Research	46.45	46.46	43.73	49.39	47.22
Deep Researcher Reflect–Evolve	46.21	43.44	45.48	48.99	48.21
Claude Research	45.00	45.34	42.79	47.58	44.66
Nvidia AIQ Research Assistant	40.52	37.98	38.39	44.59	42.63
Perplexity Research	40.46	39.10	35.65	46.11	43.08
Grok Deeper Search	38.22	36.08	30.89	46.59	42.17

The result is strong enough to take the architecture seriously. It is not strong enough to declare the matter settled.

The most interesting pattern is not the overall rank alone. Reflect–Evolve scores 48.21 on readability, higher than OpenAI Deep Research in the table and not far below the two leading systems. Its insight score, 45.48, also exceeds OpenAI Deep Research’s 43.73 in the reported table. But its comprehensiveness score, 43.44, lags behind OpenAI, Gemini, and Tavily.

That combination is revealing. A centralized memory and one-shot synthesis may help the system produce coherent, readable reports with decent analytical depth. But the lower comprehensiveness score suggests the architecture does not automatically guarantee broader coverage than stronger competing systems. Shared memory is not a magic bag of sources. Annoying, but useful to know.

The evidence mainly supports architecture plausibility

The paper includes several evaluation views: the main leaderboard comparison, language-level comparison, and field-level breakdown across 22 academic disciplines. These should not all be interpreted the same way.

Evidence item	Likely purpose	What it supports	What it does not prove
Main RACE leaderboard table	Main evidence and comparison with prior systems	Reflect–Evolve is competitive with several named deep research agents	That sequential reflection is the sole cause of the score
Four-dimension RACE comparison	Diagnostic breakdown	Strengths and weaknesses differ across comprehensiveness, insight, instruction following, and readability	That all dimensions improved because of the same mechanism
Language comparison	Exploratory or robustness-style check	The system performs across both English and Chinese tasks, with stronger reported Chinese-language performance	That the approach is language-universally robust
Field-level results across 22 disciplines	Robustness/sensitivity-style breakdown	Performance is not shown only on one narrow topic category	That the system is equally reliable in every domain
Candidate Crossover design details	Implementation detail and mechanism proposal	The system uses local candidate diversity before storing answers in global context	The independent marginal value of crossover

This distinction is important because AI papers often invite a little interpretive inflation. A benchmark table becomes a universal architecture law. A field breakdown becomes proof of generality. A design description becomes a causal result. The paper is valuable enough without that extra decoration.

The direct evidence shows that this particular implementation, powered by Gemini 2.5 Pro and evaluated on DeepResearch Bench, achieves a competitive RACE score. The mechanism proposed by the authors is plausible: shared context plus plan reflection should reduce redundancy and improve adaptation. But the paper does not provide the kind of component-by-component ablation that would let us quantify exactly how much each module contributes.

So the correct reading is not “parallel is dead.” The correct reading is “parallel fan-out without shared reflective state is a weak default for research tasks where discoveries should change the plan.”

That is a much better sentence. Less viral, more deployable. Tragic, really.

The business value is fewer wasted searches, not more impressive agent diagrams

For business users, the paper maps cleanly onto several workflows: market intelligence, competitor tracking, investment due diligence, regulatory monitoring, supplier research, technology scouting, and internal knowledge synthesis.

These tasks have a common shape. The first search often changes the second question. The second question may reveal that the original taxonomy was wrong. A good analyst does not simply execute the original checklist. A good analyst updates the checklist.

This is where sequential plan reflection becomes operationally meaningful.

Business workflow	What usually goes wrong	What the paper’s design suggests
Market intelligence	Repeated summaries of the same public sources	Store search trajectories and force the planner to avoid redundancy
Due diligence	Early findings do not change later investigation	Add explicit reflection checkpoints before continuing
Regulatory monitoring	Reports become collections of jurisdictional fragments	Use global context to compare overlaps and conflicts across regions
Technical scouting	Search branches miss dependencies between concepts	Let discoveries update the research plan, not just the final report
Executive briefing	Final synthesis is coherent but evidence coverage is uneven	Separate research progress estimation from writing quality

The ROI pathway is therefore not “buy a larger agent swarm.” It is more boring and more useful: reduce duplicated search, improve gap detection, preserve discovered facts, and make each subsequent query more informed than the last.

This also changes how teams should evaluate research agents. A demo report can look polished while hiding a poor research trajectory. The more revealing questions are:

Did the agent remember what it had already searched?
Did it update the plan when new evidence appeared?
Did it stop because coverage was sufficient, or because a fixed step count ran out?
Did the final report cite facts gathered throughout the process, or only the most recent branch?
Can the system show its search trajectory, not just its final answer?

The last question is especially important. In business settings, explainability is not only about model internals. It is also about workflow traceability. A research agent should be able to show why it looked where it looked.

The 90% progress threshold is useful, but also fragile

One detail deserves more attention than it usually gets: the system uses an LLM-as-a-judge to estimate research progress, and the research loop stops when progress crosses a 90% threshold or when maximum retries are exhausted.

This is sensible. Without a stopping rule, sequential research can become endless research. Anyone who has opened “just one more source” at 1:17 a.m. understands the problem intimately.

But progress scoring is also fragile. What does 90% coverage mean for a moving research plan? If the plan is incomplete, a high progress score may simply mean the agent has completed a weak plan. If the judge overestimates coverage, the final report may become readable but shallow. If the judge underestimates coverage, the system may waste cost on redundant exploration.

This does not undermine the architecture. It points to an implementation requirement. For enterprise use, progress scoring should not be a decorative number. It should be tied to visible criteria: required subtopics, source diversity, recency, jurisdictional coverage, stakeholder questions, and unresolved contradictions.

Otherwise, “90% complete” becomes the AI version of “almost done.” A phrase with a proud history of lying.

Where this architecture should and should not be used

Sequential plan reflection is most attractive when the research path is uncertain and the cost of missing an important branch is high. It is less attractive when the task is simple, well-decomposed, or latency-sensitive.

Use this pattern when:

the topic is broad and under-specified;
early findings should change later searches;
redundancy is expensive;
the final report needs cross-branch synthesis;
users care about the research trail, not only the final prose.

Be more cautious when:

the question has a fixed schema;
sub-tasks are genuinely independent;
the user needs a fast first-pass answer;
retrieval quality dominates reasoning quality;
the cost of repeated LLM calls is tightly constrained.

The paper does not provide a full cost and latency analysis. It also does not prove robustness across all business domains. The benchmark contains broad academic-style tasks, which are useful for stress-testing synthesis, but not identical to enterprise workflows with private data, compliance requirements, internal taxonomies, and access controls.

That boundary matters. A research agent for public doctoral-level tasks and a research agent for a bank’s internal risk committee may share architecture patterns, but they do not share deployment constraints. The former can search widely. The latter must remember permission boundaries, audit logs, source lineage, and policy rules. The shared notebook needs a lock on the drawer.

The practical lesson: orchestrate curiosity before you automate synthesis

The paper’s strongest contribution is not that it wins every leaderboard row. It does not. Nor is it that parallelism is obsolete. It is not; the system itself uses local parallel candidate generation.

The stronger lesson is architectural: research agents should not be treated as report generators with search attached. They should be treated as adaptive investigation systems. The plan, the memory, the search trajectory, and the stopping rule are part of the intelligence.

That is the difference between an agent that produces a polished document and an agent that conducts something closer to research.

For Cognaptus readers, the business takeaway is straightforward. Before adding more agents, ask whether the system has a shared research context. Before adding more searches, ask whether the next query is informed by the previous ones. Before celebrating a beautiful final report, ask whether the research plan learned anything along the way.

Sequential does not beat parallel because waiting is noble. Sequential beats naive parallelism when reflection changes what should happen next.

That is the useful version of the claim. Less macho. More correct. Naturally, it will be harder to put on a slide.

Cognaptus: Automate the Present, Incubate the Future.

Saurav Prateek, Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve), arXiv:2601.20843, January 2026, https://arxiv.org/abs/2601.20843. ↩︎

A faster swarm is still blind if every worker owns a private notebook#

The paper compares three places to put reflection#

The Global Research Context is the actual control layer#

Candidate Crossover is parallelism in a small box#

The benchmark result is competitive, not a coronation#

The evidence mainly supports architecture plausibility#

The business value is fewer wasted searches, not more impressive agent diagrams#

The 90% progress threshold is useful, but also fragile#

Where this architecture should and should not be used#

The practical lesson: orchestrate curiosity before you automate synthesis#

A faster swarm is still blind if every worker owns a private notebook