TL;DR for operators
DEoT is not “a smarter chatbot”. It is a structured analysis workflow for questions where there is no single correct answer: policy impact, market entry, geopolitical risk, crisis response, investment implications, technology disruption, and the usual executive swamp where every answer arrives with a footnote and a headache.
The paper’s useful idea is simple: open-ended analysis needs two motions. First, go wide enough not to miss important dimensions. Then go deep enough not to produce a shallow consultant-flavoured smoothie. DEoT formalises this through a Breadth Engine, a Depth Engine, and an Engine Controller that decides when to branch, when to drill, and when to stop.
The paper reports strong results on its N2Q benchmark of 500 news-derived open-ended questions across biomedicine, economics, geopolitics, industry, and technology. DEoT achieves an overall win rate of 85.5% against GPT-4o and 77.2% against Perplexity AI, with particularly strong results in analytical depth and innovation.1
The awkward bit, and therefore the business-relevant bit, is that DEoT’s weakest scores are in practicality and specific arguments. Against GPT-4o, its overall practicality figure is 22.9 and its specific-arguments figure is 46.3. Against Perplexity AI, practicality is 20.6 and specific arguments collapse to 3.5. That is not a rounding error. That is the paper politely telling us: “Excellent at structured exploration; still not great at evidence-grounded execution.”
For operators, the lesson is not “deploy DEoT and fire the analysts”. The lesson is better: design AI analysis systems as controlled exploration loops. Use breadth to map the terrain, depth to investigate priority branches, retrieval to ground claims, validation to stop obvious nonsense from travelling downstream, and human review to convert possible implications into executable decisions. Annoyingly, the human is still not optional.
The real problem is not bad answers; it is flat analysis
Most executive questions do not look like exam questions. They look like this:
Should we enter this market? What happens if this regulation passes? Which risks matter if supply chains shift? How should we interpret a competitor’s new product launch? What are the second-order effects of a geopolitical event?
A normal language model can produce an answer to each. The problem is not silence. The problem is false completeness. The answer has structure, paragraphs, perhaps even bullet points wearing a tie, but the reasoning process is usually hidden and flat. It may mention regulation, competitors, customers, costs, and risk. It may not show whether those branches were systematically explored, which branches were abandoned, what evidence supported each, or why one issue deserved deeper investigation.
DEoT, short for Dual Engines of Thoughts, is designed for that kind of open-ended analytical problem. The framework’s central premise is that open-ended questions require a dynamic balance between horizontal exploration and vertical investigation. This is why the paper compares DEoT with Chain-of-Thought, Tree-of-Thoughts, and Graph-of-Thoughts, but the more revealing analogy is not a chain or tree. It is a mind map with a supervisor.
The mind-map analogy matters because business analysis is rarely a straight line. An analyst starts with a core question, radiates into political, economic, technical, operational, and social branches, then chooses a few branches for deeper work. DEoT tries to mechanise that rhythm.
That is the mechanism-first story here. The reported win rates are interesting, but they are not the actual operating lesson. The actual lesson is that reasoning quality can improve when the system does not merely generate content, but manages analytical direction.
DEoT is a workflow, not a magic model wearing two hats
A reader could easily misunderstand the paper as saying “two engines are better than one model”. That is too crude. DEoT is not a new foundation model. It is an orchestration framework that uses several components to turn a vague analytical question into a layered exploration process.
The architecture has three main parts.
| Component | What it does | Operational translation |
|---|---|---|
| Base Prompter | Refines the raw user question by clarifying ambiguity, adding temporal context, standardising entities, and correcting failed query attempts. | Do not let vague executive prompts enter the system untreated. “How is TSLA doing?” is not a question; it is a cry for structure. |
| Solver Agent | Decomposes the query into one to three tasks, selects tools, executes tasks in dependency order, and validates results. | Turn the analysis into a small workflow before producing prose. |
| Dual-Engine System | Uses an Engine Controller to choose between breadth expansion and depth investigation, then iterates through layered query-answer nodes. | Decide when to scan the map and when to zoom into a branch. |
The Solver Agent is important because it prevents the dual-engine idea from becoming decorative. It has a Planner, Toolbox, Executor, and validation mechanism. The Planner decomposes the problem into a few essential tasks. The Toolbox includes news search, event extraction, historical analysis, information search, and direct reasoning. The Executor runs tasks in order and combines results. The validation layer checks factual accuracy and consistency.
Then the Dual-Engine System takes over the analytical trajectory.
The Breadth Engine identifies multiple distinct aspects of the problem. In the paper’s prompt design, it generates impact aspects, categories, reasons, follow-up queries, and priorities. The Depth Engine does the opposite: it generates one focused follow-up question for deeper analysis of a specific issue. The Engine Controller decides which engine should operate next, considering content complexity, current layer, maximum depth, unresolved points, and diminishing returns.
This makes DEoT less like a chatbot and more like a constrained analyst loop:
- Clean the question.
- Plan a few tasks.
- Retrieve or reason with tools.
- Validate intermediate outputs.
- Branch outward across relevant dimensions.
- Drill downward into selected questions.
- Repeat until depth, breadth, or usefulness limits are reached.
- Synthesize the node summaries into a final report.
That is the design contribution. Not “AI thinks like humans now”, which is the kind of phrase that should be escorted out of the room. The more precise claim is that DEoT imitates one useful analyst behaviour: alternating between exploration and focus.
The N2Q benchmark tests open-ended news analysis, not universal intelligence
The paper also introduces N2Q, a benchmark of 500 questions derived from recent news. It covers five domains: biomedicine, economics, geopolitics, industry, and technology, with 100 questions per domain.
The construction process is worth reading carefully because it defines what the results mean. The authors collect recent news articles from sources such as Reuters, BBC, and Financial Times. They use topic modelling and similarity filtering to maintain diversity. GPT-4o then generates a follow-up question from each article, combining a news summary with an open-ended analytical prompt. Human verification is used to check that the questions are challenging and relevant, though the paper gives limited detail on that verification process.
This makes N2Q a useful benchmark for the kind of work many organisations actually ask AI systems to perform: “Given this event, what might happen next?” It is closer to policy briefs, market analysis, risk notes, and strategy memos than to standard fact-recall benchmarks.
But it also narrows interpretation. N2Q is not proof that DEoT is generally superior across all reasoning tasks. It tests news-derived open-ended analysis under a specific evaluation setup. The questions are designed to reward broad implications, causal reasoning, historical comparison, and strategic recommendations. Conveniently, that is exactly where a breadth-depth system should shine. This is not a flaw; it is alignment between benchmark and intended use. It just means operators should not overread the result.
The benchmark asks: can an AI system produce richer open-ended analysis across current-event domains?
It does not ask: can the system make expert decisions, verify every factual claim independently, estimate implementation cost, or survive contact with a CFO who has seen three “strategic transformation” decks before breakfast.
The main evidence says DEoT is strong at exploration
The paper compares DEoT with GPT-4o and Perplexity AI using GPT-4o as the evaluation judge. The outputs are assessed across five criteria: analytical depth, specific arguments, innovation, practicality, and logical coherence. The authors use a dual comparison procedure to reduce order bias: responses are evaluated in both model orders, and a winner is selected for each criterion.
The main evidence is in the comparison tables. The qualitative figures are illustrative. The appendices mainly provide implementation details, especially prompts. There is no ablation study showing exactly how much each component contributes. That absence matters.
Against GPT-4o, DEoT reports:
| Criterion | Overall result |
|---|---|
| Total win rate | 85.5 |
| Analytical depth | 92.5 |
| Innovation | 84.0 |
| Logical coherence | 70.6 |
| Specific arguments | 46.3 |
| Practicality | 22.9 |
Against Perplexity AI, DEoT reports:
| Criterion | Overall result |
|---|---|
| Total win rate | 77.2 |
| Analytical depth | 85.7 |
| Innovation | 91.9 |
| Logical coherence | 74.5 |
| Specific arguments | 3.5 |
| Practicality | 20.6 |
The high-level result is clear: DEoT wins often overall, and it wins especially often on analytical depth and innovation. This is exactly what the architecture predicts. Breadth expansion increases the range of considered dimensions. Depth investigation encourages layered follow-up. A final response agent then synthesizes the resulting node summaries. If the judge rewards multi-layered reasoning, causal relationships, and novel angles, DEoT has a structural advantage.
The more interesting result is where DEoT does not win.
Its practicality scores are low in both comparisons. This means the system’s outputs may be analytically rich but less immediately implementable. That is familiar to anyone who has read a strategy memo that identifies seven interdependencies, four stakeholder groups, and absolutely no next Monday morning action. Very impressive. Mildly useless.
The specific-arguments result is even more revealing. Against GPT-4o, DEoT’s overall score is 46.3. Against Perplexity AI, it drops to 3.5. The paper attributes this weakness against Perplexity to Perplexity’s real-time retrieval advantage. That explanation is plausible. A system designed to explore implications may still lose badly to a retrieval-heavy system when the criterion rewards concrete data, examples, and evidence.
For business use, this is the difference between “good thinking structure” and “good substantiation”. DEoT appears better at organising the search space. It is not automatically better at proving its claims.
The qualitative example shows integration, not courtroom evidence
The paper includes a qualitative comparison around a geopolitical question involving President Biden, Israel’s Netanyahu, ceasefire negotiations in Qatar, and broader Middle East dynamics. DEoT is described as producing a more structured and multifaceted analysis, combining historical context, alliance dynamics, conflict resolution, real-time developments, and cross-category conclusions.
This example helps explain the mechanism. DEoT can pull historical parallels into the same frame as current developments and policy implications. GPT-4o may cover multiple points without tying them into a larger analytical structure. Perplexity may be concise and current, but less deep in historical or causal synthesis.
That is useful, but it should be classified properly. This is not the main quantitative evidence. It is a comparison example, useful for making the architecture legible. It shows the kind of output DEoT is optimised to produce: integrated, layered, and multi-perspective.
It does not prove that DEoT is more accurate on geopolitics. Nor does it prove that its recommendations are better. It shows a difference in response shape.
In open-ended business analysis, response shape matters. A structured memo can expose trade-offs better than a flat answer. But response shape is not truth. A beautifully organised falsehood is still false, only now with headings.
The appendices are implementation breadcrumbs, not a second thesis
The appendices are unusually useful because they expose the prompt-level machinery behind the framework. They show how the Base Prompter optimises inputs, how the Planner decomposes tasks, how validation is prompted, how the Breadth and Depth Engines generate follow-up questions, and how the final response is assembled.
This matters because DEoT is not only a concept. It is a practical prompt-and-agent architecture. The prompts reveal several design choices that operators can borrow without copying the entire system.
For example, the Planner is constrained to generate only one to three tasks. That is a sensible limit. Many agent systems collapse not because they lack ambition, but because they have far too much of it. A planner that spawns twelve subtasks for a normal executive query is not “agentic”; it is a meeting invitation in software form.
The news search tool is constrained to retrieve a limited number of articles. The event extractor focuses on what happened, who was involved, when and where it occurred, consequences, and numerical data. The historical analyzer asks for two to three parallels with market, political, and business implications. The validator is instructed to check factual accuracy rather than general quality.
These are implementation details, not robustness tests. They do not prove that each module is necessary. They do, however, show the operational philosophy: constrain each stage, define outputs explicitly, and keep intermediate reasoning products structured enough for later synthesis.
That is the part many enterprise AI deployments miss. They buy a strong model, add a system prompt, connect a search tool, and then wonder why the output behaves like a clever intern with unlimited coffee and no manager. DEoT’s contribution is managerial. It gives the reasoning process a supervisor.
What the paper directly shows, and what business should infer
The paper directly shows that, on the N2Q benchmark and under GPT-4o-as-judge evaluation, DEoT beats GPT-4o and Perplexity AI on overall win rate, analytical depth, innovation, and logical coherence. It also shows that DEoT struggles on practicality and evidence-specific argumentation, especially against Perplexity AI.
Cognaptus infers a narrower but more useful business lesson: open-ended AI systems should be designed as exploration-control systems, not just answer-generation systems. The valuable object is not the final paragraph. It is the workflow that produced the paragraph.
| Paper result | Business interpretation | Boundary |
|---|---|---|
| Strong analytical-depth performance | Use breadth-depth loops for strategy, policy, risk, and market questions where first-order answers are insufficient. | Depth may produce elaboration without better evidence. |
| Strong innovation performance | Structured branching can surface non-obvious angles and second-order effects. | Novelty is not usefulness. Some “creative” outputs will deserve a quiet burial. |
| Better logical coherence than baselines | Intermediate planning and synthesis can reduce fragmented analysis. | Coherence does not guarantee factual accuracy. |
| Weak practicality | Add feasibility, cost, compliance, resource, and implementation modules before using outputs operationally. | Without these modules, DEoT is a briefing assistant, not an execution planner. |
| Weak specific arguments versus Perplexity | Retrieval and evidence grounding must be upgraded if the task depends on factual support. | A better reasoning scaffold cannot compensate for weak sources. |
The cleanest business use case is decision preparation. DEoT-like systems can help analysts generate issue maps, identify missing questions, compare causal pathways, and prepare structured memos. They are useful before the decision meeting, not as a replacement for the decision meeting.
For finance teams, this could mean mapping market reactions to policy changes and then drilling into the most exposed sectors. For logistics firms, it could mean scanning regulatory, fuel, port, insurance, and geopolitical effects before investigating one cost driver. For government or public-sector clients, it could mean mapping stakeholder effects and implementation risks before drafting policy options. For technology strategy, it could mean separating infrastructure, regulation, user adoption, vendor concentration, and capability maturity before pretending “AI transformation” is a plan. We have all suffered enough.
The missing ablation is the uncomfortable silence
The biggest methodological limitation is not that the paper uses an LLM judge. That is important, yes, but expected in open-ended evaluation. The bigger interpretive gap is the lack of ablation.
We do not see a test of DEoT without the Breadth Engine. We do not see DEoT without the Depth Engine. We do not see the framework without validation. We do not see whether the Engine Controller is materially better than a fixed breadth-then-depth schedule. We do not see whether gains come mostly from using multiple tool calls, richer prompts, stronger retrieval, longer outputs, or actual engine coordination.
This does not invalidate the result. It limits the mechanism claim. The architecture is plausible, and the results are aligned with the architecture, but the paper does not isolate which component does the heavy lifting.
For operators, this matters because implementation budgets are finite. If most of the gain comes from better query refinement and retrieval, then building an elaborate dual-engine controller may be theatre. If most of the gain comes from engine switching, then the controller is the core asset. The paper gives us a strong design direction, not a procurement specification.
Another boundary is evaluation. GPT-4o acts as judge, including when GPT-4o is also one of the compared systems and a component used within DEoT. The dual comparison procedure helps reduce order bias, but it does not fully replace expert evaluation. In domains such as biomedicine, finance, law, or geopolitics, expert review would be needed before treating the outputs as decision-grade.
Finally, N2Q is derived from recent news and generated partly with GPT-4o. That makes it relevant to real-world analytical work, but it also means the benchmark is close to the kind of broad synthetic reasoning that frontier LLMs already handle well. More demanding tests would include expert-scored recommendations, evidence audits, costed implementation plans, and longitudinal checks against what actually happened.
The operating model: breadth first, depth where it pays, evidence always
A practical DEoT-inspired system should not simply copy the paper and declare victory. It should translate the architecture into controls.
First, require query refinement. Many business prompts are under-specified because executives communicate in compressed anxiety. The system should clarify entities, timeframes, geographies, decision criteria, and intended output before analysis begins.
Second, separate exploration from investigation. A breadth pass should identify dimensions and assign priorities. A depth pass should investigate the branches with the highest decision value. The system should record why a branch was prioritised.
Third, attach evidence to claims. DEoT’s specific-argument weakness is the flashing warning light. Retrieval should not be an optional accessory; it should be built into any claim that depends on current facts, numerical estimates, regulations, or market conditions.
Fourth, add practicality filters. If the output includes recommendations, the system should check resources, cost, timing, organisational constraints, compliance risks, and likely implementation blockers. Otherwise the system will generate sophisticated advice that dies on contact with procurement.
Fifth, preserve intermediate nodes. The final answer should not be the only artifact. For serious work, the user should be able to inspect the branch map: what was explored, what was ignored, what was validated, and where uncertainty remains.
This is where the business value lives. Not in a prettier final memo, but in an inspectable analytical process.
Conclusion: the future analyst is part cartographer, part miner
DEoT’s best contribution is not that it beats GPT-4o or Perplexity AI in a benchmark. Benchmarks are useful, but they are not strategy. The more durable contribution is the architecture: open-ended analysis improves when a system explicitly manages when to go wide and when to go deep.
That is a serious lesson for enterprise AI. Many organisations are still trying to automate answers. The better target is automating analytical scaffolding: issue discovery, branch prioritisation, evidence gathering, validation, and synthesis. The human analyst then spends less time building the map from scratch and more time judging which route is worth taking.
The paper’s own results keep the claim honest. DEoT is strong at analytical depth and innovation. It is weaker at practicality and evidence-backed specificity. So the correct deployment posture is not “AI analyst in a box”. It is “AI research scaffold with adult supervision”.
Two heads may be better than one. But only if one of them remembers to check the facts, price the recommendation, and ask whether the thing can actually be done before everyone starts admiring the mind map.
Cognaptus: Automate the Present, Incubate the Future.
-
Fei-Hsuan Yu, Yun-Cheng Chou, and Teng-Ruei Chen, “Dual Engines of Thoughts: A Depth-Breadth Integration Framework for Open-ended Analysis,” arXiv:2504.07872, 2025. https://arxiv.org/abs/2504.07872 ↩︎