Benchmarks Are From Mars, Workflows Are From Venus: Why AI Research Co‑Pilots Keep Failing in the Wild

Lab meeting. The principal investigator cuts the validation budget from $15,000 to $5,000. The postdoc has already discussed the original plan with an AI research co-pilot. The agent previously suggested a 10-marker flow cytometry panel, bulk RNA-seq validation, and immunofluorescence. Now the researcher returns and says: we need to prioritize.

A useful co-pilot should not simply repeat the original protocol with a smaller price tag. It should remember the hypothesis, preserve the scientific goal, understand the new constraint, propose a cheaper validation path, and know which evidence can be deferred without making the proposal look scientifically flimsy. In other words, it must behave less like a brilliant autocomplete box and more like a collaborator with a working memory, a sense of context, and a modest respect for reality. A rare feature, apparently.

That is the central argument of From Task Executors to Research Partners: Evaluating AI Co-Pilots Through Workflow Integration in Biomedical Research, a rapid review of AI benchmarks for preclinical biomedical research.¹ The paper reviews 14 benchmarks published or released between 2018 and October 2025. It finds a striking pattern: the field is getting better at measuring component skills, but it still mostly does not measure whether an AI system can survive the actual shape of research work.

That distinction matters. Biomedical AI systems are often marketed as research assistants, scientific agents, discovery partners, or lab co-pilots. But many evaluation regimes still ask narrower questions: Can the model answer biomedical questions? Can it reconstruct a protocol? Can it detect protocol errors? Can it generate plausible hypotheses? Can it retrieve and cite papers?

Those are not trivial skills. Some are hard. Some are safety-critical. But they are still pieces. The paper’s sharper point is that research is not a pile of isolated pieces. It is a multi-session, constraint-riddled, feedback-heavy process where yesterday’s decision quietly shapes tomorrow’s recommendation. Current benchmarks are good at checking whether the brick is strong. They are much weaker at checking whether the building stands.

The Dr. Martinez case shows the missing unit of evaluation

The most useful part of the paper is not the benchmark catalogue itself. It is the illustrative research journey built around “Dr. Martinez,” a cardiovascular biology postdoc using an AI co-pilot to develop a grant proposal from single-cell RNA sequencing data.

The scenario spans four sessions over 99 hours.

Session	What the AI appears to do well	What current benchmarks can assess	What they mostly miss
Monday, 9:00 AM	Analyzes scRNA-seq data and identifies an unusual cluster	Data analysis quality, marker interpretation, computational correctness	Whether it asks clarifying questions before overcommitting
Tuesday, 3:30 PM	Refines the hypothesis after the researcher raises a macrophage-contamination concern	Literature retrieval, citation relevance, hypothesis plausibility	Whether it handles critique gracefully instead of defending its first answer
Thursday, 11:00 AM	Revises validation experiments under a $5,000 budget	Protocol design and methodological appropriateness	Whether it navigates constraints creatively rather than merely shrinking the original plan
Monday, 2:00 PM	Integrates the final proposal after PI feedback	Scientific writing quality, proposal structure	Whether it remembers the earlier budget limit and rejected options after a multi-day gap

This case is not an empirical experiment. It is a conceptual stress test. Its purpose is to expose a unit-of-analysis problem: most benchmarks evaluate tasks, while researchers experience workflows.

That sounds obvious until procurement begins. A vendor can show strong scores on literature QA, protocol planning, tool use, or hypothesis generation. A biotech executive may then infer that the system is “ready for research operations.” The paper is basically saying: not so fast. A system can pass several component tests and still fail the workflow in exactly the places users notice first.

The failure may be mundane. It forgets that the lab lacks a platform. It suggests an experiment incompatible with the budget discussed last week. It agrees too quickly with a flawed researcher assumption because sycophancy is cheaper than scientific disagreement. It asks for the same background context again and again. It writes a polished proposal section that quietly contradicts the experimental plan. The prose looks professional. The workflow is leaking.

The review finds progress, but progress in the wrong measurement layer

The authors searched PubMed/MEDLINE, Web of Science, IEEE Xplore, bioRxiv, and arXiv, covering January 1, 2018 to October 31, 2025. They identified 3,247 records, retained 1,803 unique records after deduplication, reviewed 175 full texts, and included 14 benchmarks in the final analysis.

Those 14 benchmarks are not primitive. The review groups them into five evaluation dimensions:

Evaluation dimension	Examples in the review	What this measures well	Why it still falls short
Traditional performance metrics	BLUE, BLURB, BioASQ, PubMedQA, BioLaySumm	Retrieval, classification, QA, summarization, F1-style comparison	Often assumes bounded tasks with stable answers
Multistep reasoning and experimental planning	LAB-Bench, BioPlanner, CRISPR-GPT, BioML-bench	Protocol planning, experimental design, pipeline construction	Still often evaluates completion, not collaboration
Safety and error detection	BioLP-bench, CRISPR-GPT, LAB-Bench contamination controls	Protocol error detection, guardrails, leakage monitoring	Safety under conversation and changing constraints remains under-tested
Knowledge synthesis and discovery	Dyport, ScholarQABench, BioDiscoveryAgent	Hypothesis generation, multi-paper synthesis, closed-loop design elements	Does not fully capture long-term research partnership
Tool-augmented workflows	LAB-Bench, BioDiscoveryAgent, BioML-bench	Use of databases, code execution, literature search, ML pipelines	Tool use is tested in structured settings, not messy institutional workflows

This is a useful map because it avoids a cheap argument. The paper is not saying current biomedical AI benchmarks are useless. Many are sophisticated. LAB-Bench uses a broad 8-category framework with 2,457 evaluation questions across 31 subtasks and includes “human-hard” cloning scenarios. CRISPR-GPT evaluates automated gene-editing experimental design across 288 test cases. ScholarQABench uses 1,451 biomedical questions from PhD experts and highlights citation hallucination problems. BioLP-bench injects critical mistakes into protocols to test whether models can detect failure-causing errors. BioDiscoveryAgent evaluates iterative experimental design and reports improvement over Bayesian optimization in its studied setting.

So the issue is not benchmark laziness. The issue is benchmark geometry.

Most benchmarks slice research into assessable components because components are easier to score. That is how you get leaderboards, reproducibility, and fast iteration. The trade-off is that a component score can become a proxy for a capability it does not actually represent. Welcome to Goodhart’s Law wearing a lab coat.

The benchmark score is not the co-pilot

The likely reader misconception is simple: if a biomedical model performs well on QA, protocol design, hypothesis generation, or tool-use benchmarks, it must be ready to operate as a research co-pilot.

The paper’s correction is more precise: high component performance is necessary but not sufficient. It can tell us that a system has ingredients. It does not tell us whether the system can cook.

A research co-pilot must carry information across sessions. It must update recommendations when a PI changes the budget. It must know when to ask for experimental context. It must keep track of which interpretation was rejected, which marker panel was accepted, which paper became central, and which validation route was no longer feasible. It must also manage the researcher’s experience: not too vague, not too verbose, not too deferential, not too stubborn. A collaborator who is always agreeable is not a collaborator. It is a very expensive rubber stamp.

The paper frames this as a workflow integration gap. Current benchmarks assess outputs; real research depends on processes. Current benchmarks often reward correct answers; real researchers need correction responsiveness, context maintenance, and constraint propagation. Current benchmarks can grade a protocol; real labs need the system to remember why the cheaper protocol was chosen in the first place.

That is a different evaluation object.

The four missing dimensions are operational, not philosophical

The authors propose a process-oriented framework with four core dimensions: dialogue quality, workflow orchestration, session continuity, and researcher experience. These are not decorative UX categories. They are operational requirements for AI systems that claim to support research work.

Missing dimension	What it asks	Operational failure it catches
Dialogue quality	Does the agent ask clarifying questions, explain reasoning, handle correction, push back when needed, and avoid robotic friction?	The agent gives a plausible answer too early, agrees with flawed assumptions, or defends an error
Workflow orchestration	Do later recommendations reflect earlier decisions, constraints, and goals?	The agent generates a strong protocol that no longer matches the hypothesis or budget
Session continuity	Can the agent resume after hours, days, or weeks while retaining the right context?	The agent forgets the $5,000 constraint, reopens rejected options, or requires full context repetition
Researcher experience	Does the interaction support calibrated trust, low cognitive load, usability, and learning?	The output is correct but exhausting, opaque, overtrusted, or underused

This framework has a useful business translation: the evaluation target shifts from “Can the model complete the task?” to “Can the system reduce coordination cost without increasing scientific risk?”

That is the practical question for AI deployment in research organizations. In a biotech or pharma environment, the marginal value of an AI co-pilot does not come only from producing a literature summary faster. It comes from compressing the messy handoffs between literature review, data interpretation, experimental planning, protocol revision, budget negotiation, internal review, and documentation. If the system cannot preserve constraints across those handoffs, its speed becomes less useful. Worse, it may accelerate inconsistency.

The figures are conceptual scaffolding, not hidden experiments

The paper includes several diagrams: an integrated research workflow versus isolated benchmark evaluation, four deployment gaps beyond current benchmark scope, a paradigm gap diagram, the Dr. Martinez timeline, and the proposed process-oriented framework.

These figures should be read as conceptual diagrams. They organize the argument; they do not provide experimental validation. This matters because the article is easy to overread. The authors are not showing that workflow-oriented benchmark scores predict publications, patents, grant success, or new discoveries. They are arguing that current evaluation designs systematically miss capabilities that would plausibly determine deployment success.

That is still valuable. In fact, for product and procurement work, conceptual clarity may matter more than another leaderboard column. Many organizations do not fail at AI adoption because they forgot to ask for a benchmark number. They fail because the benchmark number answered the wrong question.

What this means for biotech, pharma, and lab-AI vendors

The business implication is not “ignore benchmarks.” That would be silly, and silliness already has enough venture funding.

The better implication is to treat component benchmarks as entrance exams, not deployment evidence. A biomedical research AI system should still demonstrate competence in retrieval, reasoning, protocol design, safety checks, and tool use. But before it is trusted as a co-pilot, it should also be evaluated through multi-session workflow scenarios.

For biotech and pharma buyers, that suggests a practical evaluation checklist:

Procurement question	Why it matters	Example test
Can the system propagate constraints?	Research decisions are shaped by budget, equipment, timelines, sample availability, and review requirements	Introduce a budget change in Session 2 and check whether Session 4 recommendations still respect it
Can it resume after realistic gaps?	Research work is episodic, not one continuous chat	Pause for one day or one week, then ask the system to continue without restating everything
Can it handle researcher correction?	Scientific collaboration requires updating beliefs, not defending first drafts	Challenge an initial interpretation and check whether the agent integrates the critique substantively
Can it ask useful clarifying questions?	Premature certainty is dangerous in ambiguous experimental settings	Provide incomplete experimental context and measure whether the agent asks before recommending
Can it calibrate trust?	Users need to know what to verify and what can be accepted as routine	Ask the system to label which claims need wet-lab validation, expert review, or simple documentation
Can it reduce cognitive load?	An accurate but exhausting system will not be adopted	Track backtracking, repeated context entry, unnecessary verbosity, and user-rated workload

For vendors, the implication is sharper. A product demo built around isolated task success will look increasingly inadequate. The stronger demo is a scenario: a researcher starts with data, revises a hypothesis, receives new constraints, returns days later, and produces a coherent proposal or protocol package. The product must show its memory, not just its mouth.

For internal AI teams, the paper also suggests how to design pilots. Do not test the system only with one-off prompts. Create standardized workflow vignettes. Include temporal gaps. Inject constraints. Ask domain experts to grade not only outputs but also dialogue behavior and consistency. Track where the agent forgets, over-agrees, contradicts itself, or requires expensive human repair.

In business terms, workflow benchmarking estimates the cost of supervision. Component scores estimate the quality of isolated outputs. Both matter, but they are not the same cost center.

The ROI logic is supervision cost, not magic discovery

There is an obvious temptation to turn this paper into a grand claim about AI accelerating biomedical discovery. The paper does not prove that. Its contribution is more grounded and therefore more useful.

A workflow-aware co-pilot could create value through four mechanisms:

Lower context-reconstruction cost. Researchers spend less time re-explaining project history, constraints, and prior decisions.
Lower inconsistency cost. Later outputs are less likely to conflict with earlier assumptions, budgets, or experimental choices.
Lower review burden. Better trust calibration helps researchers focus verification on uncertain or high-risk claims.
Higher process throughput. Data interpretation, literature review, protocol revision, and proposal drafting become a more continuous workflow.

Notice what is missing: guaranteed better science. The paper does not show that workflow-integrated agents produce more valid hypotheses, better experiments, or more successful grants. It shows that current benchmarks cannot adequately tell us whether they would.

That boundary should be preserved. For investors and executives, the near-term business value is not “AI discovers the cure.” It is “AI reduces the friction in research coordination without silently breaking the research logic.” Less cinematic, more useful.

The proposed framework is expensive because reality is expensive

The authors are clear about implementation challenges. Workflow evaluation is harder than component benchmarking for three reasons.

First, ground truth is ambiguous. In a real research workflow, there may be multiple scientifically reasonable paths. One researcher might prioritize mechanism; another might prioritize feasibility. One lab may have access to a sequencing platform; another may rely on flow cytometry. A benchmark must distinguish legitimate methodological diversity from actual failure. “Different” is not automatically wrong. “Forgot the budget constraint” is wrong.

Second, expert evaluation is costly. The paper estimates that a single multi-session scenario may require 4–6 hours of expert time for design, simulation, and evaluation, compared with seconds for automated component metrics. That makes workflow benchmarking expensive. It also makes it meaningful. Many important things refuse to be cheap.

Third, conversational systems are stochastic. The same scenario may unfold differently across runs because of researcher phrasing and model sampling. A serious benchmark would need multiple samples, documented settings, and metrics robust enough to tolerate superficial variation while detecting substantive contradictions.

The paper’s practical answer is a tiered evaluation architecture:

Tier	Evaluation type	Purpose
Tier 1	Automated screening: turn counts, backtracking frequency, goal achievement proxies	Cheaply filter obviously weak systems
Tier 2	Expert assessment of workflow coherence and scientific quality	Evaluate promising systems where human judgment matters
Tier 3	Real-researcher longitudinal studies	Validate whether benchmark performance predicts adoption and deployment value

That funnel is sensible. It also prevents the framework from becoming another beautiful academic instrument that nobody can afford to play.

What the paper directly shows, and what Cognaptus infers

The business reading needs clean separation.

Layer	Claim	Confidence boundary
What the paper directly shows	Existing preclinical biomedical AI benchmarks reviewed in the paper evaluate isolated component capabilities, not full multi-session research collaboration	Supported by the rapid review of 14 benchmarks
What the paper proposes	A four-dimensional process-oriented framework: dialogue quality, workflow orchestration, session continuity, and researcher experience	Conceptual framework grounded in review findings and related evaluation traditions
What Cognaptus infers	Enterprise AI buyers should add workflow tests before treating biomedical AI systems as research co-pilots	Reasonable operational inference, not directly validated by deployment studies
What remains uncertain	Whether scores on workflow-oriented benchmarks predict publications, discoveries, grant success, ROI, or long-term adoption	The paper explicitly notes that empirical validation is still needed

This distinction is important because AI evaluation often suffers from measurement inflation. A benchmark becomes a score; the score becomes a capability; the capability becomes a product claim; the product claim becomes a procurement slide. Somewhere along the way, the original measurement quietly disappears. Very convenient. Very dangerous.

The deeper lesson: evaluate the handoff, not only the answer

The paper’s most transferable insight extends beyond biomedical research. Any AI system that claims to support complex professional work must be judged at the handoff points.

In legal work, the handoff is from case intake to document review to argument drafting. In finance, it is from data ingestion to scenario analysis to risk memo to portfolio action. In operations, it is from diagnosis to recommendation to implementation monitoring. In biomedical research, it is from literature to data to hypothesis to experiment to proposal to iteration.

The failure mode is the same: each module looks good alone, while the workflow quietly loses context between modules. The AI can summarize, classify, draft, and recommend. Then it forgets why the recommendation was constrained in the first place.

That is why the Dr. Martinez case is a better opening than a leaderboard. It reminds us that the unit of productivity is not the answer. It is the continuity of useful work.

Boundary: this is a framework paper, not a deployment verdict

The limitations are not minor footnotes; they define how the paper should be used.

The review is limited to preclinical biomedical research and excludes clinical and translational applications. It uses rapid review methods, so the search is intentionally streamlined rather than exhaustive. Gray literature coverage may miss unpublished industry evaluations or non-English work. Many recent benchmarks lack longitudinal adoption evidence. Most importantly, the proposed workflow framework has not yet been empirically validated across systems, laboratories, or deployment contexts.

So the right conclusion is not that current research co-pilots fail in all real settings. The right conclusion is that current benchmark evidence is insufficient to prove they will succeed in real settings. That is a narrower claim, and a stronger one.

The paper is useful because it gives organizations a better diagnostic lens. Before asking whether a model is “good at biomedical research,” ask which part of research it was tested on. Was it a single answer, a single protocol, a single hypothesis, or a multi-session collaboration with memory, correction, constraints, and user trust?

If the answer is only the first three, then the system may still be valuable. But do not call it a co-pilot yet. At best, it is a talented passenger with a very confident voice.

Conclusion: from leaderboard competence to collaborative reliability

The biomedical AI field has made real progress in evaluating literature understanding, protocol design, safety checks, hypothesis generation, and tool use. The reviewed benchmarks are not empty exercises. They have pushed the field beyond toy tasks and into harder scientific capabilities.

But the next bottleneck is workflow reliability. A research co-pilot must not only answer correctly. It must remember, revise, negotiate, resume, and help the researcher maintain calibrated trust across a project that unfolds over time. That is where current benchmarks are thin.

For businesses building or buying AI research systems, the message is practical: do not retire component benchmarks, but stop pretending they are deployment tests. Add multi-session scenarios. Add constraint propagation. Add correction responsiveness. Add memory decay probes. Add researcher experience measurement. The goal is not a prettier leaderboard. The goal is to measure whether the system can keep the work coherent after the demo ends.

Because in the wild, science does not happen in one prompt. It happens after the budget changes, the PI disagrees, the data gets weird, and the researcher comes back four days later expecting the co-pilot to remember what everyone else already forgot.

Cognaptus: Automate the Present, Incubate the Future.

Lukas Weidener, Marko Brkić, Chiara Baccin, Mihailo Jovanović, Emre Ulgac, Alex Dobrin, Johannes Weniger, Martin Vlas, Ritvik Singh, and Aakaash Meduri, “From Task Executors to Research Partners: Evaluating AI Co-Pilots Through Workflow Integration in Biomedical Research,” arXiv:2512.04854, 2025, https://arxiv.org/abs/2512.04854. ↩︎

The Dr. Martinez case shows the missing unit of evaluation#

The review finds progress, but progress in the wrong measurement layer#

The benchmark score is not the co-pilot#

The four missing dimensions are operational, not philosophical#

The figures are conceptual scaffolding, not hidden experiments#

What this means for biotech, pharma, and lab-AI vendors#

The ROI logic is supervision cost, not magic discovery#

The proposed framework is expensive because reality is expensive#

What the paper directly shows, and what Cognaptus infers#

The deeper lesson: evaluate the handoff, not only the answer#

Boundary: this is a framework paper, not a deployment verdict#

Conclusion: from leaderboard competence to collaborative reliability#