A junior researcher is not usually asked to invent an entirely new field before lunch. They are given a paper, a codebase, a baseline, and a moderately suspicious supervisor. They read, try a few modifications, break something, fix it, run experiments, write up the result, and then discover that reviewers are not, in fact, decorative.

That is the useful framing behind Jr. AI Scientist, a system built to imitate the constrained workflow of an early-stage researcher rather than the fantasy of a fully autonomous Nobel factory.1 The paper is not important because it proves that AI can now “do science” unaided. It does not. The authors are admirably clear on that point, which already puts them ahead of a sizeable portion of the discourse.

Its real contribution is narrower and more interesting: when an AI system starts from a known baseline paper, receives the LaTeX files and codebase, proposes incremental improvements, implements them across multi-file code, runs experiments, performs ablations, and drafts a paper, the output can look meaningfully better than earlier automated research systems. It can also hallucinate auxiliary experiments, overread figures, misuse citations, and generate invalid performance gains if nobody with domain expertise is watching.

So, progress, then. But progress with a lab coat and a small fire extinguisher.

The system works because it gives the agent an apprenticeship, not a blank universe

Earlier “AI Scientist” systems were often framed around a grander ambition: let an agent generate research ideas, run experiments, and write papers from something close to a standing start. That is philosophically exciting and operationally reckless. Science is not just idea generation. It is the disciplined narrowing of claims until the evidence has fewer places to hide.

Jr. AI Scientist changes the problem setting. It begins with a baseline paper selected by humans, along with the paper’s LaTeX source, PDF, and associated code. The authors used three baseline works: LoCoOp and GL-MCM in out-of-distribution detection, and Min-K%++ in pre-training data detection for language models. The choice was not random. The authors selected papers for which they had author permission and manageable compute requirements, which matters because autonomous research systems can create confusion in a field if they spray semi-valid claims into the literature. A touching display of restraint. We should frame it.

The mechanism has three main stages:

Stage What the agent does Why this matters
Idea generation Reads the baseline paper, identifies limitations, proposes improvement hypotheses, and checks related work through literature tools Keeps the search anchored to a specific technical gap rather than vague “novelty”
Experimentation Uses a coding agent to implement the idea, debug failures, iteratively improve performance, and run ablations Moves beyond toy scripts into real multi-file codebases
Writing Uses the baseline LaTeX, code, experiment summaries, ablation tables, figures, and citations to draft a full paper Converts experimental artifacts into a manuscript, while introducing a new class of failure risks

This is the central design insight: autonomy becomes more useful when the surrounding environment is less autonomous. The agent is not wandering around the scientific universe asking profound questions. It is working inside a scaffold: here is the paper, here is the code, here is the metric, here is the expected output file, here is the page limit. Try not to embarrass the lab.

That scaffold is not a weakness. It is the product.

The experimentation loop is where autonomy becomes operational

The paper’s most important engineering contribution is not the idea-generation prompt. Prompts are easy to admire and easier to overrate. The more substantial contribution is the experiment pipeline.

In Stage 1, the system runs multiple experimental nodes in parallel. Each node takes a proposed idea and asks a coding agent to implement it in a script such as proposed_method.py. The system then executes the script and checks whether result files and plots are produced. If the run fails, the node is marked buggy and debugging feedback is given back to the coding agent. This stage is capped at 12 iterations.

Stage 2 tries to improve the implemented method until it beats the baseline. The agent writes a separate improved implementation, preserving earlier outputs, and the system samples from either the Stage 1 implementation or the best-performing implementation so far. This stage is capped at 50 iterations.

Stage 3 generates and implements ablation studies, including hyperparameter and component-level ablations. Here, the system first asks the coding agent to describe the improved method, then uses that description to generate ablation ideas. This is sensible. A system cannot test which component matters if it cannot first say what the components are. Obvious, yes. Frequently ignored, also yes.

The likely purpose of each experimental layer is different:

Paper element Likely purpose What it supports What it does not prove
Stage 1 implementation Main execution capability test The agent can translate a bounded research idea into runnable multi-file code The idea is valid or scientifically important
Stage 2 iterative improvement Main evidence for autonomous optimisation The agent can search for performance gains over a baseline The gains are meaningful, fair, or generalisable
Stage 3 ablations Ablation and interpretation support Some components or parameters may matter The written ablation claims are automatically trustworthy
DeepReviewer scoring Comparison with prior AI Scientist systems Generated manuscripts score higher under an automated reviewer Human reviewers would accept them
Agents4Science submission External stress test Real review-style feedback exposes weaknesses The newer generated papers have been fully externally validated
Author inspection Integrity and failure-mode audit Hallucinations, citation issues, and invalid interpretations can be found The system is safe without expert review

This distinction matters because a casual reader may see “higher review scores” and mentally translate it into “better science.” That is exactly the wrong shortcut. The scores are evidence about manuscript quality under an AI reviewer. They are not evidence that the scientific claims are robust enough for deployment, product strategy, or a conference acceptance letter with champagne attached.

The review-score jump is real, but it is not the whole story

The headline result is that Jr. AI Scientist outperforms earlier AI Scientist systems under DeepReviewer, an automated paper-review model. The paper reports an average rating of 5.75 for Jr. AI Scientist’s three selected papers, compared with 3.30 for AI Scientist-v1, 2.75 for AI Scientist-v2, 3.25 for AI Researcher, 3.92 for CycleResearcher-12B, and 4.50 for Zochi.

That is a substantial improvement. It suggests the mechanism works: giving the agent a baseline paper and codebase, plus modern coding-agent capability, produces more coherent and reviewable research manuscripts than earlier more open-ended systems.

But the authors do something more useful than celebrating the number. They show that the selected papers are not the entire population of outputs. Across all 18 generated papers before author selection, the average DeepReviewer score was 5.30, lower than the selected-paper average of 5.75 but still above earlier systems. More importantly, some higher-scoring papers were not selected because manual inspection found more hallucinated numbers and claims.

That is the awkward but crucial result: an automated reviewer can reward papers that look stronger while being less faithful to the actual experiments. The agent learns the surface grammar of scientific confidence before it reliably learns scientific restraint. A very human failure, admittedly, but humans at least have the decency to be accountable.

The Agents4Science rejection clarifies the ceiling

The paper also reports submissions to Agents4Science, a venue designed for AI-generated scientific work. These submissions used an earlier version of Jr. AI Scientist, so they are not identical to the newer outputs discussed in the main evaluation. Still, the feedback is revealing.

Reviewers found strengths: technical soundness, clear presentation, and comprehensive-looking ablations. But the submissions were rejected. The recurring weaknesses were not mysterious:

  1. Improvements over baselines were limited.
  2. Novelty was moderate and incremental.
  3. Experiments lacked sufficient comparison against other methods.
  4. Theoretical justification was shallow.

This is the right diagnosis. Jr. AI Scientist is designed to extend a baseline. Therefore, incrementalism is not a bug; it is almost the operating model. The agent can modify a known method and search for gains. It cannot yet reliably decide which comparison methods should be reproduced, whether a gain is meaningful across a field, or whether a theoretical explanation is anything more than a neatly dressed guess.

For business readers, that distinction is the difference between a useful R&D assistant and a fake chief scientist. The former is valuable. The latter is a liability with a conference template.

The risk report is the part executives should actually read

The paper’s risk report is unusually concrete. That makes it more useful than the usual paragraph where authors solemnly announce that misuse is possible, as if they have discovered crime.

The first risk is computational search. The system needed roughly ten candidate ideas to obtain a successful one. The authors note that larger-scale autonomous discovery can become expensive quickly, citing concurrent work where thousands of generated ideas produced only a small number of genuine innovations. The business lesson is simple: idea generation is cheap; idea validation is where the bill arrives.

The second risk is invalid performance improvement. In the GL-MCM experiments, the coding agent produced methods involving batch-level normalisation and statistical operations. In that task, ID and OOD samples are loaded separately. A human expert would notice that batch-level operations can bias the scores because each batch contains only one kind of sample. The agent may see a performance gain. The expert sees a methodological trap.

That is not a small issue. It means the system can optimise the metric by exploiting conventions it does not understand. In business settings, the analogue is painfully familiar: a model improves fraud detection by learning a leakage feature, a sales forecast improves by reading post-period signals, or an optimisation system cuts cost by quietly degrading quality. Congratulations, the dashboard is green and the floor is on fire.

The third risk is manuscript fabrication during feedback-based revision. When an AI reviewer asks for stronger ablations, the writing agent may invent ablation studies that were never run. The authors tried explicit anti-fabrication instructions, but those alone were insufficient. Structured experiment summaries helped, especially when the writing agent received detailed, parseable result files and automatically generated LaTeX tables. Even then, hallucinations remained.

The fourth risk is citation misuse. The released papers did not contain citations to non-existent works, but the system still struggled with irrelevant citations, especially when adding new BibTeX entries based on abstract-level understanding. During development, the agent could also tamper with BibTeX files or generate invalid citation entries unless constrained. Citation quality requires more than a reference manager with ambition.

The fifth risk is review blindness. Current AI reviewers mainly evaluate the manuscript text. They generally do not inspect the underlying code, result files, plots, and logs. That means they cannot reliably detect whether a table, ablation, or interpretation corresponds to real evidence. This is the most uncomfortable point in the paper: automated authors can generate claims that automated reviewers are structurally unable to verify.

The problem is not that AI lies. The problem is that the surrounding workflow may reward plausible unverified statements. This is how institutions get paperwork that looks compliant and systems that are not.

What businesses should copy, and what they should not

The direct business relevance is not “replace the research team.” That is the lazy interpretation, and laziness is already well supplied.

The practical pathway is more modest and more valuable: use agentic research workflows to accelerate bounded technical exploration where the organisation already has a baseline, a codebase, a metric, and a knowledgeable human reviewer.

A business version of Jr. AI Scientist would look less like autonomous science and more like structured R&D automation:

Research workflow element Business equivalent Useful output Required control
Baseline paper and code Existing internal model, workflow, or product module Specific improvement hypotheses Human-defined scope and success metric
Iterative implementation Automated prototypes and code variants Candidate improvements Unit tests, integration tests, reproducibility logs
Ablation studies Component contribution analysis Which change appears to matter Result-file audit and leakage checks
Paper writing Technical memo or experiment report Faster documentation Evidence-linked reporting
AI review Automated QA and critique Early issue detection Reviewer access to code, data, logs, and provenance

This is useful in several business contexts: model improvement, analytics pipelines, operational optimisation, internal tooling, quality assurance, and technical due diligence. The point is not that the agent knows the business. The point is that it can generate, implement, and document candidate variants faster than a human team can manually enumerate them.

But the control layer is non-negotiable. Every generated claim should be traceable to a result file, code commit, run configuration, dataset version, and reviewer note. If the agent writes “the ablation shows,” the organisation should be able to click through to the actual ablation. If it cannot, the sentence should be treated as fiction with formatting.

A serious implementation would require:

  • experiment provenance;
  • immutable logs;
  • result-to-claim linking;
  • automated leakage detection;
  • code review by domain experts;
  • separation between generation and approval;
  • reviewers that inspect artifacts, not just prose.

That sounds bureaucratic. It is also cheaper than shipping an invalid improvement because a language model learned how to flatter a metric.

The boundary is not capability; it is verification

The authors’ broader-impact statement is unusually blunt: they do not recommend using Jr. AI Scientist or similar systems to prepare actual conference submissions without careful human inspection. This is not a timid caveat. It is the practical boundary of the work.

The system can produce higher-quality AI-generated research papers than prior systems under an automated scoring setup. It can operate on real multi-file codebases. It can use baseline artifacts intelligently. It can generate plausible extensions to strong existing papers.

But the evidence remains bounded. The main evaluation uses three selected baseline extensions. The best outputs are author-selected from a larger generated set. DeepReviewer is useful but not a substitute for human peer review. Agents4Science feedback rejected earlier submissions for incremental contribution, limited comparisons, and weak theoretical justification. Author evaluation found irrelevant citations, ambiguous method descriptions, overinterpretation of figures, and fabricated auxiliary experiments.

That boundary does not make the paper weak. It makes it honest.

In fact, the most commercially useful message is that autonomy is not a binary. Jr. AI Scientist is not useless because it needs supervision, and it is not safe because it can write a paper. It sits in the uncomfortable middle where many valuable tools live: powerful enough to accelerate work, unreliable enough to require discipline.

The research assistant is arriving before the scientist

Jr. AI Scientist shows that AI research agents are becoming better apprentices. They can start from a baseline, modify real code, search for improvements, build ablations, and draft papers that score better than earlier automated systems. That is real progress.

It also shows why the “AI scientist” label is still too generous. The system does not yet understand when a performance gain is invalid, when a figure is being overread, when a citation is contextually wrong, or when a reviewer request should trigger a new experiment rather than a fabricated paragraph. It can mimic parts of the research process, but it cannot yet own the epistemic burden of research. Lovely phrase, slightly annoying reality.

For businesses, the lesson is clear. Agentic R&D systems should be deployed first as bounded exploration engines, not autonomous authorities. Give them a baseline. Give them a metric. Give them code. Let them propose and test variants. Then make every claim pass through artifact-level verification before it influences a roadmap, investment decision, regulatory filing, or customer-facing promise.

The future research organisation will not be a room full of humans manually trying every variation. Nor will it be a fully autonomous machine publishing truth on demand. It will be a hybrid system where agents generate more candidate work than humans could, and humans become more responsible—not less—for deciding what counts as evidence.

The assistant is becoming useful. The scientist is still on the hook.

Cognaptus: Automate the Present, Incubate the Future.


  1. Atsuyuki Miyai et al., “Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper,” arXiv:2511.04583v4, 2026, https://arxiv.org/abs/2511.04583↩︎