Hunt Globally, Miss Nothing: Why Tree-Based AI Agents Beat ‘Run-It-Longer’ Research

Deals are not usually lost because nobody wrote a beautiful market summary.

They are lost because the right asset sat in a regional announcement, under a local-language alias, attached to a company page, trial registry, conference PDF, or corporate filing that nobody searched properly. Then, six months later, the same asset appears in a large-pharma partnership press release, and everyone acts surprised. The surprise is often very well-formatted. That does not make it useful.

This is the practical problem behind the paper Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence.¹ The paper is not mainly about making research agents more eloquent. It is about making them harder to fool by omission.

The distinction matters. Most “Deep Research” tools are impressive at producing coherent reports from retrieved evidence. But asset scouting is not a report-writing task. It is a coverage task. The question is not “Can the system explain what it found?” The question is “Did it find the assets that matter before your competitor did?”

The paper’s answer is that “run it longer” is not enough. A long sequential browse can exhaust obvious sources and then spend additional time walking in circles. Bioptic Agent, the system proposed in the paper, tries a different design: treat research as a structured search tree, preserve discovered candidates and evidence as persistent artifacts, validate candidates against hard criteria, deduplicate aliases, and allocate future search effort toward branches that are still producing new valid assets.

That may sound less glamorous than “an AI analyst that thinks deeply.” It is also closer to how serious scouting actually fails.

The real bottleneck is recall, not prose quality

A normal due-diligence report rewards clarity, synthesis, and citation discipline. Those are useful, but they are downstream qualities. If the system never finds a China-origin biologic disclosed in Chinese sources, no amount of elegant English summary will repair the gap.

The paper frames this as a shift from depth-first research to completeness-first search. In drug asset scouting, the target output is often a set: all qualifying assets that satisfy a multi-constraint query. The query may combine modality, mechanism, indication, stage, geography, ownership, trial signals, and competitive-position constraints. A single missing item can matter commercially.

That is why the paper spends so much energy on benchmark construction. A weak benchmark would ask a model to find assets that are already easy to search. That would mostly reward English-language visibility and familiar names. A better benchmark must make missing things measurable.

The authors therefore build the benchmark backward from validated regional program records. Instead of starting with a user query and asking agents to find whatever they can, they first mine regional drug assets from local-language sources, enrich and validate those assets, then generate investor-style queries for which those assets are correct answers. Direct identifiers such as drug names, codes, trial IDs, unique URLs, and rare aliases are forbidden in the generated query. This prevents the benchmark from becoming a disguised string-matching exercise.

That design choice is the first important contribution. It changes the evaluation target from “Can the model answer a known question?” to “Can the system recover a hard-to-surface asset when the query describes the class of opportunity rather than naming the asset?”

That is closer to actual BD and investor work. Nobody starts a serious scouting project by saying, “Please find this exact drug code I already know.” They ask for assets matching a strategic thesis. The painful part is discovering what qualifies.

The benchmark is built to punish English-centric comfort

The benchmark pipeline begins with a Regional News Miner Agent. It iterates across region, language, source, and development-stage combinations. The curated source table includes the United States, China, Japan, Korea, Brazil, Australia, Germany, France, Spain, and CIS countries, each with region-specific biotech or health-related sources. The point is not decorative multilingualism. The point is source coverage.

If a search agent begins from English-language global visibility, it will overweight assets already amplified by U.S. or international trade press. That is useful for yesterday’s market map. It is less useful for under-the-radar scouting.

After mining, an Attributes Enrichment Agent validates whether a candidate is actually a drug program and extracts structured evidence: developer, modality, target, mechanism of action, indication, development stage, trial records, regulatory status, and supporting provenance. This step matters because recall-oriented mining is noisy by design. A miner that never returns false positives is probably not exploring hard enough. The system therefore separates discovery from validation rather than pretending one prompt can do both perfectly.

The Google Search Agent then estimates English-versus-origin-language discoverability. Assets with heavy global amplification are less interesting for this benchmark. The authors want cases where local-language visibility is meaningful and English-centric search may fail.

Then comes query generation. Real investor and BD screening queries are clustered into intent categories such as in-licensing screening, indication landscape mapping, target-first landscape mapping, white-space hunting, geography constraints, platform scouting, catalysts, and combination-regimen discovery. Queries are also grouped by difficulty: broad, tight, and complex or multi-hop.

This gives the benchmark a useful shape. It is not merely multilingual trivia. It asks whether an agent can satisfy investor-native constraints under incomplete, heterogeneous, alias-heavy evidence.

A compact way to read the benchmark is this:

Benchmark design choice	What it prevents	What it tests instead
Start from validated regional assets	Ground truth biased toward easy English search	Recovery of assets outside the usual visibility cone
Forbid direct identifiers in generated queries	Name/code lookup disguised as reasoning	Class-based scouting and evidence aggregation
Use investor query templates	Synthetic prompts that feel academic but not commercial	Realistic BD and VC screening intent
Validate attributes with provenance	Hallucinated or stale candidate records	Evidence-backed inclusion and exclusion
Filter for under-amplified assets	Benchmark dominated by already-famous programs	Long-tail discovery where coverage matters

The benchmark is still not a complete map of global drug innovation. The authors acknowledge residual bias: news-based seed selection is not uniform across all assets. Some geographies, modalities, and stages will be easier to mine from public sources than others. But the method is directionally important because it makes a hidden failure visible: standard agents can look competent while missing the very assets the scouting workflow is supposed to surface.

Bioptic Agent turns research into branch management

The paper’s second contribution is the Bioptic Agent architecture. Its core move is simple: do not let the research process collapse into one long conversation.

Sequential agents tend to work like this: search, summarize, append prior findings, ask for more, repeat. That can help at first. But once obvious search angles are exhausted, the loop often revisits the same domains, reformulates similar searches, and produces diminishing returns. The system may look busy. Busy is not the same as complete.

Bioptic Agent instead treats exploration as a tree of directives. Each node represents a search angle. Investigator Agents execute web searches under that directive. Their candidates are merged, validated, deduplicated, scored, and stored. A Coach Agent then expands the tree by generating narrower, non-overlapping child directives based on what has already been found, which queries and domains have already been visited, and which false positives or missed criteria appeared in validation.

This is not “self-reflection” as a motivational poster. It is closer to search portfolio management.

The architecture has several mechanisms that map directly to business failure modes:

Bioptic component	Operational role	Business failure it addresses
Persistent candidate store	Keeps all discovered candidates before validation	Prevents the agent from forgetting partial finds across long searches
Validated asset store	Stores only matched, deduplicated assets	Reduces noisy candidate inflation
Evidence stores	Track executed queries and visited domains	Makes repeated search paths visible
Language-parallel investigators	Search in English and configured local languages	Reduces English-first blind spots
Criteria Match Validator	Checks candidates against query logic with evidence	Reduces false positives and loose interpretation
Deduplication Agent	Resolves aliases, codes, transliterations, and historical names	Prevents the same asset from being counted as several discoveries
Coach Agent	Generates non-overlapping child directives from gaps and errors	Redirects effort toward under-explored branches
UCB-style selection	Balances promising branches and under-tested branches	Allocates compute instead of merely spending it

The phrase “UCB-style” is important. The system does not simply choose the branch that performed best once. It uses a selection rule with an exploitation term and an exploration bonus. In plain language: keep searching branches that produce new valid assets, but also give insufficiently tested branches a chance. That is exactly the trade-off in real scouting. You want to follow productive leads without letting one early-successful angle monopolize the entire budget.

The node reward is also aligned with the job. A branch is rewarded for producing newly added, validated, deduplicated assets, adjusted by local precision. That design discourages the cheap trick of dumping hundreds of weak candidates. Volume without validity is just spam wearing a lab coat.

Validation is where generic self-correction stops being enough

One of the paper’s more useful ideas is that validators should be tailored to the end task, not treated as generic criticism.

A general self-critique loop may improve consistency, wording, or citation neatness. But a drug scouting query can fail in more specific ways. The asset may have the wrong mechanism. It may be inactive. It may be a platform rather than a drug program. It may be a multi-target asset when the query asks for an exclusive target. It may satisfy an indication criterion but fail the stage criterion. It may appear under a code name that has since changed.

Bioptic Agent’s Criteria Match Validator decomposes a query into criteria, checks each candidate against those requirements, and produces evidence-backed pass or fail rationales. Those rationales are not only used to filter results. They are fed back into the Coach Agent as search diagnostics.

This creates a loop that is more operational than philosophical:

Search under a directive.
Return candidates.
Validate candidates against hard criteria.
Deduplicate matched assets.
Summarize failure patterns.
Generate narrower child directives that avoid repeated mistakes.
Allocate future search toward branches that still add valid assets.

The system is therefore not merely asking, “Was my last answer good?” It is asking, “Which part of the search space did this branch cover, which errors did it make, and where should the next search effort go?”

That is the difference between an assistant and a scouting system.

The headline result is strong, but the mechanism is the real story

The paper evaluates Bioptic Agent and several baselines on a held-out gold test split of 22 query-asset pairs from the Completeness Benchmark. The evaluation uses recall, precision, and F1. Recall checks whether the expected ground-truth asset appears in the predicted list after alias and cross-lingual resolution. Precision checks whether predicted assets satisfy the query criteria.

The headline table is clear:

Model	Recall	Precision	F1
Bioptic Agent (GPT-5.2, high)	0.730	0.877	0.797
Gemini 3.1 Deep Think	0.636	0.554	0.592
Gemini 3.1 Pro Deep Research	0.545	0.634	0.586
Claude Opus 4.6 (high)	0.454	0.736	0.562
Gemini 3 Pro Deep Research	0.500	0.512	0.506
OpenAI Deep Research (o4-mini)	0.372	0.713	0.489
GPT-5.2 Pro (high)	0.364	0.648	0.466
Perplexity Sonar Deep Research (high)	0.409	0.481	0.442
GPT-5.2 (high)	0.182	0.683	0.287
Exa Websets (num_matches=500)	0.182	0.515	0.269

Bioptic Agent’s F1 of 0.797 is meaningfully above the strongest listed baseline, Gemini 3.1 Deep Think at 0.592. More interestingly, Bioptic does not win by sacrificing precision for recall. It reports both the highest precision, 0.877, and strong recall, 0.730.

That combination is operationally important. A high-recall but low-precision scouting system creates review burden. A high-precision but low-recall system creates false comfort. The dangerous product demo is the second one: it looks clean because it missed the messy stuff.

The paper’s Figure 1 adds the more important interpretation. Bioptic Agent improves rapidly with additional time and approaches a plateau near 0.80 F1. Sequential scaffold baselines, including a GPT-5.2 loop and an o4-mini-deep-research loop, improve more slowly and saturate earlier. The authors use these sequential scaffolds to test the natural “just run it longer” hypothesis.

This is not a minor ablation. It is the central argument of the paper. Longer browsing helps until the obvious paths are depleted. After that, structure matters.

The ablation is not a second thesis; it explains the saturation problem

The paper also compares Bioptic Agent with a no-tree, language-free ablation. This ablation keeps several components—Coach reflection, validators, and deduplication—but removes the tree structure and disables multilingual parallelism.

The result is not “everything collapses immediately.” That would be too convenient. The ablation achieves comparable quality through roughly the fifth epoch, then saturates. At 10 epochs, the no-tree variant executes 50 Investigator calls, while the full Bioptic setting executes 20 Investigator calls, yet the full system avoids the same saturation pattern.

This is useful because it separates early gains from sustained search quality. Early in the search, many methods can find obvious assets. The harder question is what happens after the first wave. Do additional calls expand coverage, or do they circle around the same sources?

The ablation suggests that tree-based exploration and multilingual rollout matter most when the search begins to run out of easy wins. The agent needs a way to partition the remaining search space, remember where it has already looked, and force attention toward under-explored branches. Without that structure, more calls can simply create more motion.

This is the point many enterprise AI discussions still miss. Compute is not a strategy. Compute becomes strategy only when the system knows where to spend it.

What Cognaptus infers for business use

The paper directly shows that a specialized, tree-based, multilingual scouting scaffold outperforms several strong general-purpose research and find-all systems on a small, domain-specific benchmark for drug asset scouting.

Cognaptus would translate that into a broader design lesson, but with boundaries: high-recall business research should be built as a controlled discovery workflow, not as a long chat session.

The lesson applies especially to workflows with these properties:

Workflow property	Why generic deep research struggles	Useful design response
Open-world set discovery	There is no single known answer	Track candidate coverage explicitly
Multilingual or regional evidence	English search misses early local disclosures	Run language-specific investigators
Alias-heavy entities	Names, codes, and transliterations fragment evidence	Build deduplication into the loop
Multi-constraint screening	Candidates can partially match and still be wrong	Use criteria validators with provenance
Diminishing returns after obvious sources	Sequential search revisits similar paths	Allocate compute through a search tree
High cost of omission	A clean but incomplete answer is dangerous	Optimize for recall under validation, not just polish

This is not only a pharmaceutical lesson. Similar patterns appear in supplier discovery, policy intelligence, emerging-market competitive monitoring, litigation research, acquisition screening, grant and patent landscaping, and technical due diligence. Anywhere the task is “find all relevant entities under messy constraints,” a report-centric agent is the wrong abstraction.

A better abstraction is closer to an evidence operating system: candidate memory, source memory, branch memory, validation, deduplication, and budget allocation.

That sounds less magical. Good. Magic is what vendors call missing edge cases before procurement notices.

What the paper does not prove

The limitations are not decorative; they affect deployment interpretation.

First, the benchmark test split contains 22 query-asset pairs. The paper’s result is strong, but the sample is small. It is enough to make the mechanism interesting. It is not enough to declare a universal law of research agents across all industries.

Second, the benchmark construction itself uses agents and LLM-based components, including generated queries and LLM-as-judge grading calibrated to expert opinions. The pipeline is carefully designed, but it is not the same as a fully independent human-labeled census of all possible assets. The benchmark reduces several biases; it does not abolish bias, because this is research, not baptism.

Third, the domain is specialized. Drug asset scouting has unusually high alias complexity, strict inclusion criteria, and valuable long-tail discoveries. The system design may transfer well to other high-recall workflows, but the exact performance should not be assumed outside similar conditions.

Fourth, operational cost matters. Bioptic Agent uses multiple investigators, validators, deduplicators, and coach steps. The paper argues that structured compute is more useful than naive longer browsing, but real deployment still needs cost controls, latency targets, audit policies, and human review thresholds.

Finally, the system is optimized for public-source scouting. It does not replace confidential diligence, clinical judgment, IP review, regulatory analysis, or deal negotiation. It improves the front end of discovery: finding and organizing candidates that deserve deeper human attention.

That is still valuable. A front door that opens to more of the right rooms is not the whole building, but it is better than a decorative wall.

The strategic takeaway is coverage control

The most useful idea in the paper is not that Bioptic Agent scored 0.797 F1. That number matters, but it is only the surface.

The deeper claim is that high-stakes research agents need coverage control. They must know what they have searched, what they have found, what failed validation, which aliases may refer to the same object, which branches are still under-explored, and where additional compute is likely to produce new valid discoveries.

This is a shift from “AI writes a report” to “AI manages an investigation.”

For business users, the procurement question should therefore change. Do not only ask whether a research tool can cite sources and summarize findings. Ask whether it can preserve candidate memory, search across local-language evidence, validate hard criteria, deduplicate entities, expose omissions, and allocate search effort away from repeated paths.

A polished answer is nice. A complete answer is rarer.

And in asset scouting, rare is where the money usually is.

Cognaptus: Automate the Present, Incubate the Future.

Vlad Vinogradov, Alisa Vinogradova, Luba Greenwood, Ilya Yasny, Dmitry Kobyzev, Shoman Kasbekar, Kong Nguyen, Dmitrii Radkevich, Ilya Shkirenko, Ivan Izmailov, Daniil Anisimov, Roman Doronin, and Andrey Doronichev, “Hunt Globally: Wide Search AI Agents for Drug Asset Scouting in Investing, Business Development, and Competitive Intelligence,” arXiv:2602.15019v5, 2026. https://arxiv.org/abs/2602.15019 ↩︎

The real bottleneck is recall, not prose quality#

The benchmark is built to punish English-centric comfort#

Bioptic Agent turns research into branch management#

Validation is where generic self-correction stops being enough#

The headline result is strong, but the mechanism is the real story#

The ablation is not a second thesis; it explains the saturation problem#

What Cognaptus infers for business use#

What the paper does not prove#

The strategic takeaway is coverage control#