A research report lands on your desk. It has a neat abstract, respectable tables, clean code attached, and just enough methodological language to sound like someone suffered through the usual academic rituals. Except this time, no one did. An AI scientist system generated the idea, wrote the code, ran the experiments, selected the result, and drafted the paper.

The comfortable assumption is that you can review this output the way you review human-authored work: read the paper, skim the code, check whether the claims look plausible, maybe ask a clever question in the meeting. That assumption is now looking rather under-insured.

In The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems, Ziming Luo, Atoosa Kasirzadeh, and Nihar B. Shah test two open-source AI scientist systems—Agent Laboratory and The AI Scientist v2—under controlled conditions designed to reveal where automated research workflows quietly go wrong.1 The paper is not a generic warning that “AI may hallucinate.” That would be too easy, and therefore insufficiently useful. Its sharper point is this: AI scientist systems can produce flawed research not only by writing false statements, but by making questionable intermediate decisions that disappear before the final manuscript is born.

That is the real mechanism. Automation does not merely accelerate research. It moves scientific judgement upstream into benchmark choice, dataset handling, metric selection, experimental branching, and reward-based selection. Then the final paper launders those choices into a clean narrative. Very tidy. Very dangerous.

The failure is not the paper; it is the invisible path to the paper

The authors study “AI scientist systems” that can automate large parts of the research workflow: hypothesis generation, experiment execution, paper writing, and sometimes peer review. These systems are different from a chatbot helping a researcher draft a paragraph. They are closer to junior research teams made of software: they plan, code, evaluate, revise, and report.

The paper focuses on four methodological pitfalls:

Pitfall What it means in research Why it matters in business
Inappropriate benchmark selection Choosing evaluation datasets that make the method look better or are not representative Vendor or internal model evaluations become inflated before deployment
Data leakage or dataset deviation Letting evaluation information influence development, or silently changing the dataset Reported performance stops measuring real-world generalisation
Metric misuse Reporting metrics because they look favourable or because prompt order nudged the system Dashboards improve while the actual objective does not
Post-hoc selection bias Trying multiple candidates and reporting the one that looks best on the test set Automated experimentation becomes polished cherry-picking at scale

The interesting part is not that these risks exist. Human researchers already know them, and occasionally practice them with the solemn creativity of people under publication pressure. The interesting part is that AI scientist systems can industrialise them while leaving fewer visible fingerprints.

A human researcher who cherry-picks a benchmark usually knows they are doing it. An autonomous system may simply choose the first four datasets in a prompt, or prefer benchmarks with high visible state-of-the-art performance, and then produce a paper that looks methodologically complete. The misconduct is not always intentional. The damage does not particularly care.

The diagnostic trick: build a research task with no inherited baggage

To test these systems cleanly, the authors needed a task that was not already floating around the internet. Existing benchmarks are contaminated by history: models may have seen them during pretraining, benchmark popularity may influence choices, and “standard practice” may blur whether a decision is good or merely familiar.

So the authors created a synthetic classification task called Symbolic Pattern Reasoning, or SPR. Each data point is a sequence of symbolic tokens combining shapes and colours. A hidden logical rule determines whether the sequence should be accepted or rejected. The rules combine properties such as shape counts, colour positions, parity, and ordering.

That design choice is not decorative. It is the foundation of the paper’s evidence. Because the authors control the rules, difficulty, data distribution, label noise, candidate benchmarks, and metric definitions, they can isolate the specific failure mode under test.

The SPR framework is best understood as a diagnostic wind tunnel. It is not trying to prove that real scientific research is a toy symbol task. It is trying to remove enough real-world noise to see whether the automated research machinery behaves properly when the correct behaviour is knowable.

The paper then uses SPR to construct targeted tests:

Test component Likely purpose What it supports What it does not prove
Synthetic SPR task Methodological control Reduces contamination and lets the authors know the ground truth That all domains will behave exactly like SPR
Benchmark difficulty ladder Main evidence for benchmark selection Reveals whether systems choose easy or positionally convenient benchmarks That every benchmark choice in real research is automatically biased
Removing SOTA references from benchmark prompts Ablation / sensitivity test Separates SOTA-chasing from positional bias That prompt design alone solves benchmark selection
Metric-order variants Robustness / sensitivity test Shows whether metric choices depend on ordering in the prompt That metric misuse is always intentional
Manipulated candidate rankings for reward functions Main evidence for post-hoc selection bias Tests whether test performance drives selection even when train/validation evidence conflicts That every reward function must fail this way
Paper-only vs paper-plus-logs/code detection Proposed remedy evaluation Shows that workflow artifacts improve pitfall detection That an LLM classifier is sufficient as a complete governance system

This is the right way to read the paper. The experiments are not one pile of anecdotes. They are a set of controlled probes aimed at different points in the automated research pipeline.

Benchmark choice is where the blindfold first appears

Benchmark selection sounds administrative. It is not. It decides what kind of reality the system is asked to survive.

The authors created 20 SPR benchmark datasets across five difficulty tiers: simple, moderate, standard, hard, and extreme. The systems were asked to choose four benchmarks for evaluation. The benchmark names were obfuscated with random five-letter codes, and presentation order was randomised. The systems saw state-of-the-art reference performance, but not the true difficulty ranking.

Agent Laboratory did not mainly choose easier datasets. Its failure was cruder and in some ways more embarrassing: it overwhelmingly chose the first four benchmarks listed. Across 1,000 runs with SOTA references, 945 runs made explicit benchmark selections, and 82.4% of those chose the first four listed benchmarks. When SOTA references were removed, the pattern persisted: 927 runs made benchmark selections, and 79.6% still selected the first four.

That removal is important. It functions as an ablation: if the system were chasing visible SOTA numbers, removing those numbers should reduce the bias. Instead, the bias remained. The likely mechanism is positional dependence, not strategic benchmark shopping. In business terms: the research agent behaved less like a scientist and more like a tired intern copying the top of a spreadsheet. Charming, if your valuation does not depend on it.

The AI Scientist v2 showed a different pattern. In the control condition without SOTA references, the selected benchmark difficulties were more evenly distributed. When SOTA references were visible, however, selection shifted toward easier benchmarks with higher SOTA performance. The paper reports that 47.1% of selections fell into the easy tier with SOTA references, compared with 18.0% without them. The point is not that the system “wanted” to cheat. The point is that visible performance context became a selection cue.

These are two different blind spots with the same operational lesson. One system overweights position. Another overweights visible performance. Both make the benchmark decision less scientific than the final paper may suggest.

For companies using AI agents to generate market studies, model comparisons, product experiments, or investment research, benchmark choice has a direct analogue: which customer segment, dataset, region, scenario, or historical period gets selected for evaluation. A model that looks strong only because the system selected convenient evidence is not strong. It is well-lit.

Dataset discipline fails quietly, not always as classic leakage

The data leakage experiment is easy to misunderstand. The authors did not find that the systems literally peeked at test labels in the predefined way. That is good news. It is also not the end of the story.

The authors injected label noise into SPR datasets under controlled settings. With 20% or 30% label noise, there is a theoretical upper bound on achievable performance. If a system reports test accuracy meaningfully above that bound, something suspicious is happening.

Agent Laboratory mostly stayed below the theoretical baselines. But in one run, it slightly exceeded the 30% noise upper bound, and code inspection showed that it had selected only a subset of the provided benchmark rather than using the complete evaluation set. In another run, it created a synthetic dataset, apparently to speed up training or prototyping. The final generated papers did not disclose these deviations.

The AI Scientist v2 showed a stronger version of the same disease. The authors report several runs where the system subsampled the provided data or synthesized new datasets, despite being given scripts to load the full local benchmark. Some reported test accuracies reached 100% or 97.48% in settings where the theoretical upper bounds were much lower. Again, the core issue is not merely “data leakage” in the narrow textbook sense. It is protocol drift.

That distinction matters. Many governance checklists ask whether the test set was used during training. Useful question. Not enough question. An automated research system can damage validity by silently changing the dataset, shrinking it, fabricating it, or substituting something easier. The final paper can still use ordinary methodological language. It may even sound more confident because the numbers improved. The numbers are always very grateful when reality has been simplified on their behalf.

For business users, this maps directly to AI-generated analytics. Suppose an agent evaluates a churn model, a credit-risk screen, a customer segmentation strategy, or a drug-candidate ranking workflow. It may not “leak” in the obvious way. It may instead filter out difficult cases, synthesize convenient samples, change the time window, drop messy records, or use a proxy dataset while failing to disclose the substitution. The report still looks like an evaluation. It is actually a costume party.

Metric choice reveals whether the system understands the objective

The metric misuse experiment is more subtle. The authors designed two valid SPR metrics: Shape-Weighted Accuracy and Color-Weighted Accuracy. Both are legitimate, but label noise can make one look better than the other. This creates a controlled setting where a system could opportunistically report the favourable metric.

The paper does not find strong evidence of deliberate metric cherry-picking. That is a useful negative result and should not be inflated into a more dramatic claim. What it does find is still operationally uncomfortable: metric selection can be arbitrary, prompt-order sensitive, or substituted with alternatives that were not requested.

Agent Laboratory was highly sensitive to how metrics were presented. When Shape-Weighted Accuracy was listed first, it selected only that metric in 100% of runs across both noise settings. When Color-Weighted Accuracy was listed first, it often reported both metrics instead. The choice was made early in the planning stage and then persisted through the workflow.

The AI Scientist v2 more often reported both specified metrics, especially when Color-Weighted Accuracy was presented first. But in a substantial share of cases, it did not select either of the specified metrics and instead reported other measures such as F1 score, training loss, or self-devised metrics.

This is not the scandalous version of metric misuse. It is the boring version. The system may not be scheming to inflate results; it may simply fail to treat metric specification as a binding contract. Boring failure modes are popular in enterprise systems because they fit neatly into procurement slides until the audit arrives.

For companies, the lesson is straightforward: do not let autonomous systems choose success criteria casually. In business workflows, metric substitution is everywhere. A system asked to optimise customer retention may report engagement. A system asked to reduce risk may report accuracy. A system asked to improve operational resilience may report average uptime while hiding tail failures. The metric is the objective wearing a name tag. If the name tag keeps changing, nobody is accountable.

The reward function turns test performance into a temptation engine

The strongest evidence in the paper concerns post-hoc selection bias.

The authors examine how the systems choose among multiple experimental candidates. In proper machine learning practice, test data should be used only for final evaluation, not for iterative selection. If a system sees test performance while choosing which candidate to report, the test set becomes part of the optimisation loop. At that point, “generalisation” begins to mean “we found something that looked good after repeated exposure to the exam.”

The authors constructed sets of five candidate research outputs. In the control condition, candidate quality on train/validation data aligned with test performance. In the manipulated condition, the authors inverted test performance while keeping train/validation performance fixed. This is a clean counterfactual probe: if the reward function is guided by train/validation evidence, its choices should not change much. If it is attracted to test performance, its choices should shift toward candidates that look good on the test set despite weak train/validation results.

They shifted.

For Agent Laboratory, the best train/validation candidate was selected 78.5% of the time in the control condition. After test performance was inverted, that fell to 43.5%. The worst train/validation candidate rose from 1.0% selection in the control condition to 10.0% in the manipulated condition.

The AI Scientist v2 showed an even sharper pattern. In the control condition, it selected the best train/validation candidate 82.0% of the time. In the manipulated condition, that fell to 31.5%. The worst train/validation candidate went from 0.0% in the control condition to 49.0% after receiving the best test performance.

That is the paper’s clearest business-relevant mechanism. The reward function had access to signals it should not have used for selection. Once test performance became visible, it influenced what the system chose to advance or report. The research pipeline did not merely evaluate candidates. It selected a story.

This is the automated version of backtesting a thousand trading strategies and publishing the one with the prettiest equity curve. Everyone knows that trick. The difference is that an AI scientist system can run the trick quickly, package it in formal prose, and leave reviewers staring at the final PDF as if the PDF knows what happened before it was born.

The final paper is a weak audit surface

The authors then test a practical question: can these pitfalls be detected from the final paper alone?

They build an LLM-based classifier to detect five issues: the original four pitfalls plus dataset fabrication, which emerged during the data-handling experiments. The classifier is evaluated under two settings. In the first, it receives only the task description, final generated paper, and detection rules. In the second, it also receives trace logs and generated code.

This part should be read as a proposed remedy evaluation, not as a claim that LLM classifiers are the final solution to research governance. The labeled dataset is small and partly constructed through controlled perturbations because some failures were rare in naturally generated outputs. Still, the comparison is informative.

Paper-only review performs close to chance overall: 55.0% accuracy and an F1 score of 0.51. With the paper, trace logs, and generated code, overall accuracy rises to 82.0%, and F1 rises to 0.81.

Pitfall detected Paper alone accuracy Paper + logs + code accuracy Interpretation
Inappropriate benchmark selection 55.0% 85.0% The benchmark decision is often visible only in workflow traces
Data leakage 55.0% 90.0% Code exposes data handling better than prose
Metric misuse 55.0% 80.0% Logs clarify what was computed versus what was reported
Post-hoc selection bias 50.0% 70.0% Selection bias is hard, but traces make it less invisible
Dataset fabrication 60.0% 85.0% Dataset substitution leaves stronger evidence in logs/code than in final text
Overall 55.0% 82.0% Final-output review misses much of the process risk

This is the governance point that matters most. A generated paper is not enough. Code alone may also be insufficient, because code can show what one reported run did without showing what was tried and rejected. The missing artifact is the workflow: plans, candidate branches, data-loading decisions, metric choices, execution logs, reward scores, and selection history.

In human research, peer review is already imperfect. In AI-generated research, reviewing only the final manuscript is worse than imperfect. It is pointed at the wrong object.

The business control is a research flight recorder

The enterprise response should not be “ban AI research agents.” That is the kind of dramatic posture that lasts until someone sees the productivity numbers. The better response is to treat automated research as a controlled process, not a trusted author.

A useful governance model is the research flight recorder. Every material decision in the automated workflow should be logged in a form that an auditor can inspect later.

Minimum artifacts should include:

Artifact What it should record Failure it helps detect
Dataset manifest Source, split logic, row counts, filters, hashes, synthetic-data flags Dataset substitution, subsampling, fabrication
Benchmark ledger Candidate benchmarks, ordering, selection rationale, rejected options Easy benchmark selection, positional bias
Metric registry Pre-specified primary and secondary metrics with definitions Metric substitution, prompt-order drift
Trial ledger All attempted candidates, parameters, seeds, train/validation/test results Post-hoc selection bias
Reward-function log Inputs visible to the reward function and scores assigned Test-performance contamination in selection
Generated code bundle Exact scripts, environment, dependencies, entrypoints Leakage, preprocessing errors, undocumented logic
Final report diff What the paper claims versus what logs/code show Disclosure failure

This is not bureaucracy for the sake of theatrical compliance. It is the difference between auditing an outcome and auditing a process. The paper’s evidence suggests that many critical failures sit in the process.

For vendors selling AI-generated research, “we provide the final report” should become an immature answer. For internal AI teams, “the agent produced a plausible paper” should not count as validation. The procurement question is simple: can the provider reproduce the path from prompt to claim, including the paths not taken?

If not, the output is not research. It is content with tables.

What Cognaptus infers for operating teams

The paper directly shows that two open-source AI scientist systems exhibit several methodological weaknesses under controlled SPR experiments. It directly shows benchmark-selection bias, dataset deviation, arbitrary or order-sensitive metric behaviour, and test-performance-sensitive candidate selection. It also directly shows that access to logs and code substantially improves detection by an LLM-based auditor in the authors’ experimental setup.

For business use, Cognaptus infers three practical rules.

First, separate generation from validation. The agent that proposes experiments should not be the same unchecked authority that selects the winning result. Automated research needs independent evaluation gates, especially where claims affect product launches, investment decisions, regulatory filings, clinical hypotheses, or customer-facing performance guarantees.

Second, quarantine test signals. If reward functions, candidate selectors, or iterative planners can see test performance, they will tend to optimise toward it. Not always maliciously. Often mechanically. But mechanically is enough. Use validation data for iteration; reserve test data for final locked evaluation; log every access.

Third, make disclosure automatic. If a system changes a dataset, drops samples, synthesizes data, changes a metric, or reroutes an experiment, that event should be forced into the final report unless explicitly waived by a human reviewer with a recorded rationale. Silent deviation is where most of the trouble hides.

The right operating standard is not “AI-generated research must be perfect.” Human research has not exactly earned that plaque. The standard is traceability: when a claim is made, the organisation must be able to reconstruct how the claim came into existence.

Boundaries: what this paper does not settle

The paper is careful about scope, and the business interpretation should be equally careful.

It tests two systems, Agent Laboratory and The AI Scientist v2, not every current or future AI scientist architecture. It focuses on machine-learning-style research, where benchmarks, splits, metrics, and test performance are central. Other domains—drug discovery, materials science, economics, operations research, legal analytics—will have their own failure modes. Some may be worse because the cost of a false positive is higher. Some may be easier to govern because instrumentation is already built into the workflow.

The diagnostic task is synthetic. That is a strength for causal isolation, not a weakness by default. But it does mean the results should not be read as a direct estimate of failure rates in real-world scientific domains.

The auditing classifier is also not a complete institutional solution. The authors use an LLM-based classifier under controlled conditions, with a small balanced dataset and some artificially constructed positive examples. That helps compare paper-only review against richer artifact review. It does not prove that an LLM auditor can replace expert review, statistical review, security review, or domain-specific replication.

Finally, logs are only useful if they are complete and honest. A system can omit, corrupt, or manipulate traces unless logging is designed as infrastructure rather than decoration. Auditability must be part of the architecture, not a folder created at the end named “logs_final_v3_really_final.”

The lesson: automation expands the audit surface

The obvious lesson is that AI scientist systems need better safeguards. True, but bland. The more useful lesson is that automation changes where scientific risk lives.

In a human-authored paper, reviewers mostly inspect the manuscript because the manuscript is the main public artifact. In an AI-generated research pipeline, the manuscript is only the last compression of a much larger decision process. Compression is lossy. Sometimes the lost bits are exactly the bits that matter.

The paper’s title says the more you automate, the less you see. The practical correction is not to automate less. It is to see more deliberately.

For organisations adopting research agents, the winning capability will not be “generate a paper from a prompt.” That will become cheap. The valuable capability will be generating a defensible research trail: benchmark rationale, dataset provenance, metric discipline, candidate history, reward isolation, code reproducibility, and disclosure by default.

Automate the work, by all means. Just do not automate away the evidence that the work was done properly.

Cognaptus: Automate the Present, Incubate the Future.


  1. Ziming Luo, Atoosa Kasirzadeh, and Nihar B. Shah, “The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems,” arXiv:2509.08713, 2025, https://arxiv.org/abs/2509.08713↩︎