TL;DR for operators
A recent arXiv paper on LLM-based agents for drug-asset due diligence shows something more useful than “AI does research now.” It shows a practical operating pattern: convert past expert memos into a measurable benchmark, send a persistent web-search agent to maximise competitor recall, then pass candidates through a stricter validator before analysts see them.1
The result is commercially meaningful because drug competitor mapping is not a neat database lookup. It is a scavenger hunt across clinical-trial registries, press releases, scientific literature, patents, investor decks, regulatory records, aliases, discontinued assets, adjacent indications, and occasionally the swamp where naming conventions go to die. The paper reports that its Bioptic Agent reaches 0.83 recall on competitor discovery after validator filtering, ahead of OpenAI Deep Research at 0.65 and Perplexity Labs at 0.60. In a private biotech VC case study, the workflow reduced analyst turnaround from roughly 2.5 days to roughly 3 hours.
The important lesson is architectural. General-purpose deep-research tools are useful, but this paper argues they are still too recall-limited for regulated, investor-specific drug landscape work. The winning design is not a clever prompt taped to a browser. It is a workflow with institutional memory, web persistence, alias-aware grading, hard-negative validation, CI/CD evaluation, budget controls, and analyst-in-the-loop review. Less magical assistant, more diligence machine with a clipboard.
The boundary is equally important. The benchmark is derived from private VC diligence memos and reflects what past experts identified, not a complete universal truth set. The system surfaces additional analyst-validated assets outside the historical ground truth, which is promising, but also proves the point: in this domain, “ground truth” is negotiated, evolving, and expensive. This is automation for faster, better-supported analysis, not a regulatory oracle in a hoodie.
The expensive failure mode is missing the competitor, not writing the memo slowly
Drug diligence starts with a deceptively simple question: what else is competing with this asset for the same clinical and commercial space?
That question decides more than slideware. It affects patent strategy, in-licensing and out-licensing arguments, trial-design choices, comparator selection, market-access assumptions, pricing exposure, and the credibility of the investment memo. In the paper’s framing, competitor discovery has become more consequential as health technology assessment and comparator expectations sharpen, especially for oncology medicines and advanced therapies. Missing the wrong asset is not a clerical error. It can become a valuation error.
The paper’s basic object is an indication-to-competitor map. Given an indication, the agent must retrieve the drugs that belong in that competitive landscape and extract attributes such as modality, mechanism of action, targets, development stage, regulatory status, administration route, therapeutic area, other indications, and company information.
That sounds database-shaped. It is not.
The authors emphasise several sources of messiness: paywalled or licensed datasets, fragmented registries, mismatched indication ontologies, drug-name aliases, multilingual and multimodal memos, changing landscapes, and investor-specific definitions of what counts as a competitor. One analyst may include a failed but instructive mechanism. Another may exclude a broad platform without a named candidate. A third may care about line of therapy, regional approval, or preclinical relevance. The spreadsheet survives because the world refuses to become a spreadsheet.
This is why the paper’s main contribution is not just a benchmark score. It is a mechanism for making a historically artisanal workflow measurable enough to improve.
The benchmark starts by mining institutional memory
The authors begin with a corpus of private biotech VC diligence memos. The initial corpus contains 210 unclassified memos spanning 15 years. From that, they select reports from 2019 to 2025 that either perform diligence on companies with at least one drug asset or assess a competitive landscape for a specific indication. This yields 73 reports and 174 drug-asset–indication pairs.
The memos are not conveniently structured. Competitive landscape details appear in free text, tables, images, screenshots, plots, and slide-like artefacts. The system therefore uses a hierarchical parsing agent based on Gemini 2.5 Pro to extract a structure that mirrors the diligence logic:
company
→ asset
→ indication
→ competitor list
→ attributes
The paper then flattens this into an indication-centred mapping:
indication
→ competitor drug
→ attributes
This is an underrated move. Most enterprise AI failures begin by pretending that documents already contain the dataset you wish you had. Here, the paper treats parsing as a first-class technical step. The pipeline uses schema-constrained prompts, verbatim evidence requirements, rejection of unsupported outputs, Pydantic validation, controlled vocabularies with open-world extension, and expert verification of merged indication aliases.
After cleaning and merging, the benchmark retains 105 unique indications with corresponding competitor drug lists. The authors then create three related datasets:
| Dataset | What it measures | Why it matters operationally |
|---|---|---|
| Competitors Dataset | Recall of competitor names per indication | Whether the agent finds the assets experts previously considered relevant |
| Attributes Dataset | Recovery of canonical drug attributes | Whether discovered competitors are useful enough for diligence, not just listed |
| Competitor-Validator Dataset | Binary validity of indication–drug pairs | Whether false positives can be filtered before analyst-facing output |
The key admission is that the competitor lists are expert-curated but not assumed complete. That is not a weakness to wave away; it is the correct description of the domain. In diligence, historical expert memory is often the best available measurement instrument, even when it is incomplete.
The system separates exploration from judgement
The accepted framing for this article is mechanism-first because the central business lesson is the workflow design. The paper does not ask a single model to be simultaneously creative, exhaustive, conservative, and perfectly calibrated. That would be adorable. It instead splits the job.
The explorer is a ReAct-style web agent. It iterates through a Thought-Action-Observation loop, formulates targeted web queries, uses retrieval tools, accumulates findings, and can run step-parallel search fan-out. The authors expose two retrieval tools to the agent: Gemini 2.5 Pro with browsing and Perplexity Sonar. In the ReAct-12-S-20 configuration, the agent performs 12 ReAct iterations and can generate up to 20 parallel web searches per step.
The validator is a separate LLM-as-judge. Given an indication and a drug name, it decides whether the drug belongs in that competitive landscape. It is based on Gemini 2.5 Flash with web search and uses three ReAct attempts. It searches authoritative sources such as clinical trial registries, regulatory filings, scientific literature, conference abstracts, market reports, investor materials, and company press releases.
The validator’s inclusion rules are deliberately stricter than the explorer’s search behaviour. It classifies a drug as a competitor only when there is verifiable evidence for the same indication, including clinical development, approval, failed or discontinued trials, case reports, or, where clinical evidence is absent, clear mechanistic relevance plus publicly documented IND-enabling or preclinical evidence. It excludes purely theoretical mechanisms, wrong identifiers, broad platforms without named candidates, and related-but-distinct indications without a direct link.
This is the paper’s reusable design pattern:
| Stage | Optimisation target | Failure being controlled |
|---|---|---|
| Parse historical memos | Structure messy expert memory | Losing institutional knowledge inside PDFs and slides |
| Resolve aliases | Avoid undercounting true matches | Brand/generic/code-name fragmentation |
| Explorer agent | Maximise recall | Missing long-tail competitors |
| Validator agent | Suppress false positives | Hallucinated or out-of-scope assets |
| CI/CD evaluation | Prevent regression | Prompt drift and scaffold changes that quietly degrade output |
| Analyst review | Keep accountability | Treating probabilistic retrieval as final diligence judgement |
The business implication is straightforward: do not build one “research bot.” Build a search-and-acceptance pipeline. Exploration and validation are different jobs. The paper’s architecture respects that difference.
The validator is doing risk control, not decorative AI governance
The Competitor-Validator is trained through prompt refinement on expert-labelled positive and negative pairs. The negatives are not random nonsense. The authors mine hard near-misses by asking GPT o3-pro to generate competitor candidates, removing known true competitors, sampling remaining candidates, and sending them to memo authors for expert labelling.
That matters because false positives in this task are rarely absurd. The hard cases are plausible: umbrella terms, nearby indications, withdrawn or unrelated programmes, candidates with mechanistic similarity but no indication-level evidence, platform technologies without a specific drug, and naming collisions. In other words, the validator is not filtering “banana” from a list of oncology drugs. It is filtering the kind of almost-right answer that makes an analyst lose an afternoon.
The validator achieves the following performance:
| Split | Precision | Recall | F1 |
|---|---|---|---|
| Validation | 90.7% | 89.5% | 90.1% |
| Test | 90.4% | 85.7% | 88.0% |
These numbers should be read as evidence that the validator is useful enough to serve as an automated precision filter and evaluation component, not as proof that it is a perfect clinical arbiter. The paper uses it both to suppress false positives before results reach client-facing components and to support automated scoring in CI/CD.
That CI/CD point is not glamorous, so naturally it is important. Agent systems degrade when prompts, tools, budgets, models, or source availability change. A diligence workflow that cannot detect regression is not a workflow; it is a recurring surprise invoice.
The main benchmark says persistence beats generic deep research
The paper evaluates several model families: non-web foundation models, web-enabled model calls, Deep Research-style agents, and LLM scaffolding agents such as ReAct and ReAct with Reflexion. All systems are tasked with thorough competitive landscape research.
The main competitor discovery table reports recall on the test split after predicted false positives are suppressed by the Competitor-Validator. This is the deployment-like setting.
| System | Recall after validator filtering |
|---|---|
| Bioptic Agent (Gemini 2.5 Flash) | 0.83 |
| ReAct-12-S-20 (Gemini 2.5 Pro) | 0.78 |
| ReAct-3-Reflexion-3-History (Gemini 2.5 Pro) | 0.77 |
| ReAct-3 (Claude Sonnet 4) | 0.68 |
| o3-pro, no web | 0.67 |
| OpenAI Deep Research | 0.65 |
| GPT-5 | 0.63 |
| Perplexity Labs | 0.60 |
| Gemini 2.5 Pro | 0.59 |
| GPT-4o | 0.56 |
The headline is not simply “Bioptic wins.” The more useful interpretation is that recall comes from persistent, scaffolded search, not from generic intelligence alone. The Bioptic Agent is an ensemble of three ReAct-12-S-20 agents run at Gemini’s default temperature and merged. It is designed to widen retrieval, not merely sound smarter.
This distinction matters for buyers and builders. A generic deep-research product may produce an impressively written report and still miss relevant long-tail competitors. In diligence, a polished omission is still an omission. The prose can wear a tie; the missing asset still damages the analysis.
The hard-sample analysis explains where the gain actually lives
The paper’s hard-sample analysis is best read as a robustness test. It asks what happens as samples become harder, where hardness is defined by the non-web o3-pro baseline’s recall. If a no-web model performs poorly on an indication, that indication is treated as difficult.
The reported pattern is intuitive and operationally useful. Non-web models degrade sharply on harder samples. Simple web-enabled models help, but the performance gap between them and iterative ReAct agents widens as difficulty increases. In the most difficult regimes, where o3-pro recovers no more than 40% of ground-truth competitors, more search iterations provide an advantage.
This is a cleaner business result than the aggregate table alone. Aggregate recall can be inflated by easy, popular, well-linked assets. The hard cases are where diligence risk accumulates: rare indications, under-linked assets, obscure company updates, non-obvious mechanisms, discontinued programmes, and fragmented evidence trails. The paper’s claim is that scaffolded web agents degrade less precisely where the analyst most needs help.
There is also a useful model-selection lesson. The authors report that Gemini 2.5 Flash remains competitive with, or even outperforms, Gemini 2.5 Pro across difficulty levels when web search is available. In this workflow, retrieval policy and scaffold design can matter more than simply using the largest model. Procurement departments may wish to pause briefly before turning that into a slogan, but the direction is hard to ignore.
Attribute extraction is useful, but MoA remains the stubborn field
Finding competitors is only the first step. Analysts also need structured attributes. The paper evaluates attribute extraction using LLM-as-judge graders, with web browsing where external evidence is required.
The comparison between OpenAI Deep Research and a lightly tuned ReAct-12 agent is mixed but informative:
| Attribute | OpenAI Deep Research | ReAct-12 |
|---|---|---|
| Overall | 0.76 | 0.82 |
| Aliases | 0.78 | 0.79 |
| Modality | 0.96 | 1.00 |
| Lead indication | 0.80 | 0.76 |
| Administration route | 0.90 | 0.91 |
| Other indications | 0.14 | 0.43 |
| MoA | 0.61 | 0.61 |
| Targets | 0.84 | 0.84 |
| Status & Stage | 0.84 | 0.92 |
| Therapeutic area | 0.92 | 1.00 |
| Company info | 0.77 | 0.89 |
The ReAct-12 agent performs better overall and is notably stronger on other indications, status and stage, therapeutic area, and company information. It matches OpenAI Deep Research on MoA and targets. MoA remains the weakest attribute for both systems at 0.61.
That weakness is not surprising. Mechanism of action is not a single label in the wild. It can be expressed at different levels of granularity: pathway, target, modality, biological process, downstream effect, or marketing-friendly scientific fog. The paper’s MoA grader decomposes the ground truth into atomic statements and measures coverage. That is a sensible evaluation choice, and it exposes the operational reality: MoA should be reviewed with more care than fields like modality or route of administration.
For business users, the implication is simple. Use the system to assemble the landscape and pre-fill structured attributes. Do not let a low-confidence MoA field silently drive investment thesis, trial comparator logic, or partnering argumentation. That is how spreadsheets become folklore.
The ablations are not a second thesis; they explain the precision–recall trade-off
The appendix reports an unfiltered evaluation regime. Unlike the main table, the Competitor-Validator is not used to suppress model outputs before recall is computed. Instead, its signal contributes to precision calculation. This is an ablation and analysis setting, not the production-like setting.
That distinction matters. Removing a conservative filter naturally raises recall and changes precision. The appendix is therefore not directly comparable to the main benchmark table. It answers a different question: what is the upper-bound extraction behaviour of different scaffolds before conservative acceptance?
The appendix reports three patterns:
| Appendix pattern | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Step-parallel search plus modest ensembling improves coverage | Ablation | Breadth of retrieval drives recall | That unfiltered outputs are safe for production |
| Reflexion improves the precision–recall trade-off | Ablation | Critique loops can prune errors and stabilise results | That more reflection is always better |
| Excessive iteration without step summaries can hurt precision | Sensitivity test | Context overload and search sprawl are real | That longer searches are useless in all settings |
The top F1 configuration in the appendix is a ReAct with Reflexion variant at roughly 0.84 F1. ReAct-12-S-20-Ensemble-3 reaches recall of 0.845, precision of 0.900, and F1 of 0.871. A high-precision ReAct-1-Reflexion-3-S-10 configuration reaches precision of 0.950 with recall of 0.740 and F1 of 0.842. OpenAI Deep Research reaches high precision, reported at 0.933, but lower recall of 0.671 and F1 of 0.780. GPT-5 is reported at precision 0.835, recall 0.636, and F1 0.722.
The business reading is not “pick the highest number.” It is: tune the scaffold to the workflow stage. Use high-recall explorers to build the candidate set. Use validators or high-precision Reflexion settings to verify candidates. Do not confuse an exploratory sweep with an analyst-ready answer.
The production result is cycle-time reduction plus discovery lift
The production section is short but valuable. The authors report deployment behind a lightweight Gradio front end, with analyst-in-the-loop review and a graph-orchestrated back end. They enforce per-tool timeouts and budgets, per-tenant rate limits, version-controlled prompts and datasets, structured logging and metrics, and CI/CD evaluation for competitor discovery and attribute extraction.
The operational case study reports that analyst turnaround per drug asset dropped from roughly 2.5 days to roughly 3 hours, or about 20× faster, while maintaining analyst-verified precision. The authors also report that the system surfaced relevant assets outside the historical ground truth, and VC analysts validated these additional assets as correct and decision-useful.
That last point is important. If the benchmark is built from historical diligence memos, then a system that only recovers memo-listed competitors is useful but bounded by past analyst work. A system that finds additional validated assets suggests discovery lift. Still, the paper does not quantify how often those extra assets change investment decisions, pricing assumptions, clinical comparator strategy, or deal outcomes. That is the next business metric, and probably the one investors should care about most.
A sensible operating dashboard would track:
| Metric | Why it matters |
|---|---|
| Time to first credible competitor list | Measures analyst acceleration |
| Recall against known historical competitors | Measures coverage of institutional memory |
| Validator rejection rate | Measures false-positive pressure |
| Analyst override rate | Measures trust and calibration |
| Newly surfaced validated competitors | Measures discovery lift |
| Decision-changing competitor additions | Measures business value, not just model activity |
| Evidence coverage per accepted competitor | Measures auditability |
The paper gives enough evidence to justify the workflow. It does not yet prove the full economic value chain. That is not a criticism. It is the difference between a strong system paper and a board-level ROI study.
What Cognaptus infers for business use
Here is the clean separation.
| Category | What the paper directly shows | Cognaptus inference | Boundary |
|---|---|---|---|
| Benchmark construction | Private VC memos can be parsed into a structured indication-to-competitor benchmark | Firms can convert past diligence into domain-specific evaluation assets | Only works if past work is accessible, legally usable, and expert-reviewable |
| Competitor discovery | A scaffolded Bioptic Agent reaches 0.83 filtered recall, above Deep Research and Perplexity Labs | Generic deep-research tools may be insufficient for high-recall regulated diligence | Results are on this private benchmark and task definition |
| Validation | A web-grounded LLM judge reaches 88.0% F1 on test for competitor validity | Separating exploration from validation is a practical risk-control pattern | Validator errors still require analyst oversight |
| Hard samples | Web-grounded, iterative agents degrade less on difficult indications | Agent scaffolding matters most in long-tail cases | Difficulty is proxied by o3-pro no-web performance |
| Attribute extraction | ReAct-12 reaches 0.82 overall; MoA remains weak at 0.61 | Structured attribute pre-fill is useful, but MoA needs review | Attribute usefulness depends on downstream tolerance for error |
| Production | Turnaround falls from about 2.5 days to about 3 hours in one VC case study | Diligence teams can compress scan cycles materially | Single case study; no full cost, adoption, or decision-impact analysis |
The bigger implication is that enterprise AI in research-heavy domains should stop treating “answer quality” as a single score. Diligence work has stages: collect, normalise, expand, validate, structure, review, and decide. Each stage has a different risk profile. The paper’s architecture is compelling because it maps model behaviour onto those stages instead of forcing one model to cosplay the entire analyst team.
Where this does not apply cleanly
The paper is strongest for recurring, research-heavy workflows where organisations already possess expert artefacts that can be converted into benchmarks. Biotech VC diligence is almost ideal: expensive analysts, repeated task structure, fragmented evidence, high cost of omissions, and rich historical memos.
It applies less cleanly when there is no historical corpus, no stable definition of the target entity, no expert review capacity, or no tolerance for probabilistic retrieval. A start-up with three messy Notion pages and a dream does not automatically have a benchmark. It has three messy Notion pages and a dream. Different species.
There are also several interpretation boundaries.
First, the ground truth is private, subjective, and incomplete. That is acceptable for measuring recall against past expert knowledge, but it limits external comparability.
Second, public UI evaluations of OpenAI Deep Research and Perplexity Labs are useful but not perfectly controlled against API-based systems and custom scaffolds. The paper acknowledges different evaluation modes. Buyers should read the direction of travel, not pretend this is a universal product ranking.
Third, the validator is itself an LLM system. Its scores are strong enough to support workflow filtering, but not strong enough to remove accountability. In regulated and investment contexts, evidence trails, analyst review, and version control remain mandatory.
Fourth, the production case study reports a dramatic time reduction but does not provide a full cost-benefit analysis. Tool calls, licensed data, analyst review time, false-positive handling, integration cost, and governance overhead all belong in the economic model.
The moat is not the model; it is the measured workflow
The paper’s most useful message is that the “AI research assistant” category is too broad to guide serious implementation. In drug competitor mapping, the hard part is not generating a confident paragraph. The hard part is knowing what evidence to search, how to keep searching when the obvious sources run dry, how to recognise aliases, how to reject plausible nonsense, and how to measure whether the workflow improved.
That is why the mechanism matters more than the headline. The authors build from institutional memory, define recall, mine hard negatives, validate candidate competitors, test hard cases, run ablations, and deploy with operational guardrails. The system is not merely answering. It is being forced to leave fingerprints.
For pharma VC, licensing, clinical strategy, and market access teams, the practical takeaway is not to buy whichever tool writes the nicest research memo. The practical takeaway is to design the diligence workflow around recall risk. Find the long tail first. Validate it ruthlessly. Show the evidence. Track analyst overrides. Keep the benchmark alive.
MoA tells you how the drug works. Moat, in this case, comes from knowing what else is out there before the market, regulator, or counterparty explains it to you at an inconvenient price.
Cognaptus: Automate the Present, Incubate the Future.
-
Vlad Vinogradov et al., “LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence,” arXiv:2508.16571. HTML: https://arxiv.org/html/2508.16571. PDF: https://arxiv.org/pdf/2508.16571. ↩︎