IRB, API, and a PI: When Agents Run the Lab

TL;DR for operators

Lab work is mostly not white coats and dramatic discoveries. It is protocol design, ethics paperwork, recruitment settings, data cleaning, model diagnostics, figure formatting, reference checking, and the slow discovery that your beautiful hypothesis has politely declined to exist.

That is what makes this paper interesting. Virtuous Machines: Towards Artificial General Science presents an agentic AI system that did not merely write a speculative research proposal. It designed and executed an online human-participant experiment, collected data through Prolific and Pavlovia, analysed the results, produced figures and tables, wrote manuscripts, and ran peer-style review over the outputs.¹

The operational result is impressive but narrower than the inevitable headline version. The system worked within a controlled cognitive psychology setting, using pre-approved tasks and a manually launched participant study. It generated three research manuscripts from one 288-participant data collection, not three fully independent field campaigns. Manual oversight remained at important points, especially around ethics, launch, monitoring, and final figure selection.

For research-heavy organisations, the useful interpretation is straightforward: this is an early template for research operations automation. The near-term value is faster literature synthesis, protocol drafting, power analysis, data cleaning, statistical modelling, figure generation, manuscript assembly, and audit documentation. The less useful interpretation is “AI replaces scientists.” That is the sort of claim that sounds exciting right up until someone asks who is responsible for the false premise embedded on page two.

The paper’s strongest signal is workflow compression. The system reportedly completed full studies in an average of 17 hours of runtime, used an average of 32.5 million tokens per study, examined 1,000–3,000 publications per literature review, and sustained autonomous data-analysis sessions averaging 8 hours 32 minutes. Its weakest signal is conceptual authority. Human expert review found solid writing, rigorous analysis, and strong literature integration, but also theoretical misrepresentations, unsupported interpretive claims, statistical omissions, internal contradictions, and visual/reporting defects.

So the practical lesson is not that the machine is now the principal investigator. It is that the research back office is becoming executable.

The lab was real, but the world was carefully fenced

The easiest way to misunderstand this paper is to picture an AI scientist wandering freely through the garden of knowledge, choosing a mystery, running a lab, and producing truth.

The actual picture is more interesting and less cinematic. The authors built a domain-agnostic agentic system and validated it in cognitive psychology because that field has remote-experiment infrastructure, established task paradigms, and ethically manageable online participation. The system was given access to a Visual Working Memory task, a Mental Rotation Task, and the Vividness of Visual Imagery Questionnaire-2. These were delivered through an HTML/JavaScript experiment hosted on Pavlovia, connected to Prolific for participant recruitment and to SurveyMonkey for questionnaire components.

That framing matters. The “real world” here was not a wet lab, a clinic, a warehouse, or a robotics bench. It was an online behavioural experiment with de-identified participants and predefined task machinery. That still counts as empirical experimentation. It just does not count as unlimited scientific agency. Good fences make good neighbours; in this case, they also make autonomous science less terrifying.

The system generated three lines of inquiry. Study 1 examined whether visual working memory precision and mental rotation performance share representational constraints. Study 2 asked whether imagery vividness shapes serial dependence effects in visual memory and mental rotation. Study 3 investigated whether visual working memory precision predicts broader spatial reasoning. The psychological findings were mostly null or negligible: no clear shared constraint between visual working memory and mental rotation, no greater carryover effects among people with stronger imagery, and negligible links between visual memory precision and spatial reasoning.

Those results are not the paper’s main intellectual payload. The main evidence is that a coordinated AI pipeline could move through the sequence of scientific work and produce complete, reviewable research artefacts. The psychology findings matter because they show the system was not merely producing decorative PDFs. It had to handle noisy participant data, exclusions, reliability issues, mixed-effects models, multiple comparisons, and the deeply glamorous business of deciding whether a small effect is meaningful or merely wearing a statistical costume.

The invention is a workflow engine, not a single genius model

The system is best understood as a research operating system built from agents. A master orchestrator coordinates specialised agents responsible for ideation, methods, implementation, data analysis, re-evaluation, visualisation, manuscript writing, review, and document construction. Under those sit further specialists: archivist agents, coding agents, troubleshooting agents, review agents, caption agents, inspection agents, power-analysis agents, and others.

This hierarchy is not ornamental. It solves a practical problem: scientific work is too long, branching, and failure-prone for one model invocation to handle reliably. A statistical model fails to converge. A literature claim needs checking. A chart looks wrong. A power analysis depends on effect-size assumptions. A citation may be fabricated. A manuscript section contradicts an earlier design choice. One prompt cannot responsibly swallow all of that and call itself science.

The paper’s architecture tries to manage this by decomposing scientific labour into modules and then adding feedback loops. Agents propose outputs, other agents review them, coding agents execute scripts, troubleshooting agents inspect failures, archivist agents retrieve literature, and document agents assemble the final product. The system also uses dynamic retrieval-augmented generation, or d-RAG, to build evolving knowledge repositories around specific research directions rather than relying only on a model’s static memory.

The authors describe four “human-inspired cognitive operators”: abstraction, metacognition, decomposition, and autonomy. Strip away the philosophical language and the business interpretation is practical:

Operator	What it does in the system	Operational value	Where it can fail
Abstraction	Helps agents form broader research ideas and heuristics rather than only following fixed instructions	Expands the search space for hypotheses	Can generate elegant but poorly grounded theory
Metacognition	Adds self-review and agent-as-judge evaluation over reasoning traces	Improves quality control and documentation	May ratify an early false premise instead of correcting it
Decomposition	Splits research into smaller tasks handled by specialised agents	Makes long workflows executable and inspectable	Interfaces between tasks can become failure points
Autonomy	Allows orchestrators to initiate subagents, replan, and stop based on validation checks	Reduces constant human supervision	Needs strong boundaries around tools, cost, ethics, and authority

This is why the mechanism-first reading matters. The system is not impressive because it wrote three papers. Plenty of language models can write things that resemble papers. Some of them can do so while confidently inventing half the bibliography, which is less “scientific discovery” and more “academic fan fiction with LaTeX.”

The more serious contribution is that the authors connected writing to upstream scientific operations: novelty checks, feasibility assessment, pre-registration-style protocol design, power analysis, recruitment configuration, data analysis, visualisation, and peer-style review. The manuscript is the final artefact. The pipeline is the product.

The implementation detail is where the risk lives

The paper’s figures and appendices should not be read as independent proof that the system understands science. They serve different evidentiary roles.

Paper component	Likely purpose	What it supports	What it does not prove
Architecture diagram	Implementation detail	The system is modular, hierarchical, and agent-based	That each agent’s reasoning is scientifically valid
Cognitive-operator framework	Mechanism explanation	The system has explicit control structures for long workflows	That those operators match human cognition in any deep sense
Ideation workflow	Implementation detail	Hypothesis generation involved novelty, feasibility, and review stages	That the selected hypotheses were intrinsically important
Generated manuscripts	Main evidence	The pipeline produced complete research artefacts	That the manuscripts are publication-ready or theory-reliable
Expert evaluation	Main quality assessment	Human scientists found both strengths and weaknesses	That the system passed formal peer review
The three psychology studies	Demonstration evidence	The system could analyse real participant data	That autonomous science generalises beyond this domain
Safety mechanisms	Implementation and governance detail	The authors considered runtime, package, response, and audit risks	That the system is safe under broader tool access

That distinction matters for buyers, funders, and research leaders. A generated manuscript is not equivalent to a validated discovery. A clean pipeline diagram is not equivalent to operational resilience. A peer-style review is not the same as peer review. And a domain-agnostic architecture demonstrated in one domain is not automatically domain-general deployment.

Still, the implementation details are unusually relevant. The system includes timeout ceilings for autonomous code execution, storage limits, package verification against security criteria, isolated virtual environments, LLM-response safety checks, API rate limits, logging, and checkpointing. Those are not academic garnish. They are the difference between “agentic research assistant” and “expensive chaos goblin with database credentials.”

The citation-validation step is especially important. The manuscript agents use DOI-based checks to reduce fabricated references. This is a small detail with large implications. In research automation, hallucinated citations are not merely embarrassing. They are supply-chain contamination. Once a fake or misrepresented citation enters a report, later agents may build on it, managers may repeat it, and decisions may quietly inherit its rot.

The numbers show workflow compression, not free science

The paper reports several operating metrics that make the system legible as infrastructure rather than demo theatre.

The system completed studies in an average of 17 hours of runtime, excluding data collection. Average marginal cost was reported as $114 per research project, excluding $4,500 in human participant payments for the experiment. Each study used an average of 32.5 million tokens. More than 50 agents contributed across stages. Literature review examined between 1,000 and 3,000 scientific publications per study. Data analysis handled 279 heterogeneous raw CSV files and produced 14–23 derived CSV files. The data-analysis agents sustained autonomous sessions averaging 8 hours 32 minutes, generated a mean of 7,696 lines of code per study, and navigated an average of 72 action-observation cycles. Each manuscript ran 7,000–8,000 words with 40–50 verified references.

That is not “AI did my homework.” That is a small research operations department, temporarily compressed into compute.

Reported measure	What it means operationally	Boundary
17 hours average runtime per study	The pipeline can compress multi-stage research production into a short execution window	Excludes participant recruitment time and assumes suitable task infrastructure
$114 average marginal system cost	Compute cost can be low relative to researcher labour	Excludes participant payments, platform costs, governance, setup, and human review
32.5M tokens per study	The workflow is token-intensive and multi-model	Costs and performance depend on model pricing and availability
8h 32m average autonomous analysis sessions	Long coding/statistical workflows can run without constant intervention	Works only if failures remain within tool and troubleshooting capacity
1,000–3,000 publications examined	Literature search can scale beyond ordinary human reading bandwidth	Retrieval volume does not guarantee conceptual judgement
40–50 verified references per manuscript	Citation validation reduces a known LLM failure mode	Verification of existence is not verification of correct interpretation

The temptation is to compare the $114 figure against a human researcher’s salary and declare victory. That would be tidy, dramatic, and wrong in the traditional way.

The relevant cost comparison is not “AI versus scientist.” It is “automated research workflow plus human oversight versus current research workflow plus human bottlenecks.” The system still needed approved experimental infrastructure, ethics approval, manual launch, monitoring, domain expertise, and post hoc judgement. The reported cost is marginal system execution, not full institutional cost.

The better economic reading is that AI may reduce the cost of producing a first-pass research artefact. It may make null studies cheaper to run and document. It may reduce the labour burden of literature synthesis, pre-analysis planning, data cleaning, and report assembly. It may also increase the volume of low-quality research output if governance is weak. Naturally, the universe never gives us productivity without paperwork.

The three studies are less important than their failure modes

The empirical findings are useful, but not because they rewrite cognitive psychology.

Study 1 found no correlation between individual performance patterns in visual working memory precision and mental rotation performance, despite expected task difficulty effects. The interpretation invoked the reliability paradox: tasks can show robust group-level effects while being poor measures of individual differences.

Study 2 found that individuals with stronger imagery vividness did not show greater carryover effects between trials. That challenges a simple version of the idea that imagery and perception share mechanisms in a way that should intensify serial dependence.

Study 3 found negligible relationships between visual working memory precision and broader spatial reasoning abilities, suggesting that apparent links among visual-spatial tasks may reflect general cognitive factors rather than specific shared mechanisms.

For operators, the important point is not whether these psychology conclusions survive future replication. It is that the system handled null-heavy empirical work without simply forcing a breakthrough narrative. Expert reviewers even noted that, when faced with statistically significant findings but small effect sizes, the system sometimes prioritised practical significance over p-value theatre. Humanity may recover from that insult eventually.

Yet the same review found that the system overreached elsewhere. It misrepresented theoretical frameworks, made unsupported interpretive claims, omitted discussion of some statistical relationships, treated collinear parameters problematically, introduced unnecessary multiple correlations, and produced visualisation and formatting defects. This is the system’s real profile: statistically capable, procedurally disciplined, rhetorically fluent, but not consistently wise.

That profile should sound familiar. It resembles a competent junior research team with excellent stamina, strong templates, broad literature access, and insufficient tacit judgement. Useful? Absolutely. Ready to run unobserved into sensitive scientific territory? Let’s not hand it the keys and call it tenure.

The business value is research throughput with audit trails

The paper’s commercial relevance is not limited to universities. Any organisation that produces evidence can see the outline of a new operating model.

Pharmaceutical teams run literature reviews, protocol drafts, statistical analysis plans, post-market studies, and internal evidence dossiers. Financial firms test investment hypotheses, alternative data signals, stress scenarios, and model validations. Consumer companies run behavioural studies, pricing tests, UX experiments, and market research. Policy teams process evidence bases and intervention evaluations. Legal and compliance groups assemble research-backed arguments. Strategy teams generate reports whose citations sometimes deserve more supervision than the intern who pasted them.

The common workflow is not “discover truth.” It is: form a question, retrieve relevant prior work, design a test, collect or connect data, analyse, document assumptions, create artefacts, review, and decide. That is precisely the workflow this paper begins to mechanise.

Business function	Near-term use	What humans should retain
R&D and product research	Draft protocols, run literature reviews, prepare analysis plans, generate first-pass reports	Problem framing, feasibility, ethics, interpretation
Market and user research	Automate survey/task design, data cleaning, segmentation analysis, report generation	Sampling judgement, customer context, decision relevance
Finance and risk	Test hypotheses, document model validation, generate reproducible analysis packs	Incentive alignment, risk appetite, regulatory sign-off
Healthcare and life sciences	Build evidence maps, draft study designs, analyse approved datasets	Clinical judgement, patient safety, compliance
Corporate strategy	Produce structured evidence briefs and scenario analyses	Strategic prioritisation and accountability

Cognaptus’s inference is therefore modest but useful: the first durable market is not autonomous science; it is governed research automation. The system becomes valuable when it is wired into approved data sources, reproducible analysis environments, review checkpoints, and organisational accountability. Less glamorous than a robot Nobel laureate, admittedly. Also more likely to survive procurement.

Human oversight belongs at the first premise

The paper’s most important limitation is not that the figures sometimes needed human cleanup. That is annoying, not existential.

The deeper problem is early-stage anchoring. The authors note that poor question formulation or conceptual errors introduced during hypothesis generation and method design can propagate downstream. Once a false premise enters the chain, later verification steps may fail to dislodge it. The system can then build a perfectly polished research object around a conceptual mistake.

This is where human expertise has maximum leverage. Reviewing a final manuscript is useful, but late. By that point, the question, method, analysis structure, and interpretive frame may already have congealed. A human expert can correct a bad figure in minutes. Correcting a bad premise after 17 hours of autonomous elaboration is less charming.

For organisations, this suggests a governance pattern:

Human review should be mandatory at problem formulation.
Ethics and compliance gates should sit before data collection or tool execution.
Statistical plans should be checked before analysis agents begin exploring.
Interpretation should be reviewed by domain experts before publication or decision use.
Audit logs should preserve not only outputs, but decision paths.

This is not a case against autonomy. It is a case for placing autonomy where mistakes are cheap and review where mistakes compound.

What this paper does not prove

The paper is careful in places, but the surrounding market will not be. So the boundaries deserve to be stated cleanly.

First, the system did not demonstrate open-ended scientific discovery across arbitrary domains. It demonstrated a sophisticated agentic pipeline in online cognitive psychology, using approved tasks and existing digital infrastructure.

Second, the three manuscripts came from one new participant data collection. Studies 2 and 3 developed independent hypotheses and analyses, but they reused the Study 1 dataset. That is legitimate for demonstration purposes, but it is not equivalent to three separately collected experimental programmes.

Third, human oversight remained material. The experiment was manually launched. Two experimenters monitored participant communications during collection. Final figure variants were selected post hoc. The system operated autonomously across many downstream stages, but not in a vacuum.

Fourth, the workflows and algorithms are proprietary to Explore Science, and authors affiliated with Explore Science are company employees. That does not invalidate the work. It does mean external reproducibility of the complete system is limited.

Fifth, there is no conventional ablation showing which architectural component mattered most. We do not know how performance would change without mixture-of-agents, without d-RAG, without the peer-style review stage, or with weaker models. The paper is a system demonstration, not a component-isolation study.

Sixth, expert review found both competence and fragility. The system was strong in writing, statistics, screening, literature integration, and methodological structure. It was weaker in theoretical nuance, subtle conceptual distinctions, complete statistical reporting, and presentation quality. That is not a footnote. It is the deployment manual.

From AI scientist to executable research organisation

“Artificial General Science” is a large phrase. The paper uses it to describe systems capable of independently driving scientific inquiry across domains: generating hypotheses, orchestrating experiments, and refining knowledge through evidence. As an aspiration, it is coherent. As a procurement category, it is premature.

The nearer future is more concrete. Research workflows will become more executable. Organisations will increasingly treat literature review, study design, analysis, figure production, and report generation as orchestrated pipelines rather than artisanal document labour. Human experts will move upstream into framing and governance, and downstream into interpretation and accountability. The middle will be progressively automated, inspected, and logged.

That shift is easy to underestimate because it does not look like the mythic version of AI. No lab-coated machine is standing at a bench, muttering Popper into a pipette. Instead, a distributed system is doing the dull work that makes scientific judgement possible: reading too much, checking assumptions, writing code, failing, debugging, documenting, formatting, and asking whether the result actually means anything.

That is not the end of the principal investigator. It is the arrival of a new kind of research staff: tireless, literal, fast, occasionally overconfident, and badly in need of supervision at the exact moment it thinks it has understood the theory.

In other words, very much like the rest of us. Just with better runtime logs.

Cognaptus: Automate the Present, Incubate the Future.

Gabrielle Wehr, Reuben Rideaux, Amaya J. Fox, David R. Lightfoot, Jason Tangen, Jason B. Mattingley, and Shane E. Ehrhardt, “Virtuous Machines: Towards Artificial General Science,” arXiv:2508.13421v2, submitted 19 August 2025 and revised 29 January 2026, https://arxiv.org/abs/2508.13421. ↩︎

TL;DR for operators#

The lab was real, but the world was carefully fenced#

The invention is a workflow engine, not a single genius model#

The implementation detail is where the risk lives#

The numbers show workflow compression, not free science#

The three studies are less important than their failure modes#

The business value is research throughput with audit trails#

Human oversight belongs at the first premise#

What this paper does not prove#

From AI scientist to executable research organisation#