The Rise of the Self-Evolving Scientist: STELLA and the Future of Biomedical AI

TL;DR for operators

STELLA is not interesting because it calls itself a “self-evolving scientist”. The internet has suffered enough from ambitious nouns. It is interesting because it attacks a real operational bottleneck in biomedical research: the best answer often requires not just reasoning, but finding the right database, building the right analysis environment, running code, checking intermediate results, and deciding when the current workflow is inadequate.

The paper introduces STELLA, a biomedical LLM-agent system that improves through two linked mechanisms: an evolving Template Library of successful reasoning workflows and a dynamic Tool Ocean that can expand when a Tool Creation Agent discovers, tests, and integrates new bioinformatics tools.¹ In reported benchmarks, STELLA reaches about 26% accuracy on Humanity’s Last Exam: Biomedicine, 54% on LAB-Bench DBQA, and 63% on LAB-Bench LitQA. The authors also show test-time gains as the number of trials increases, including HLE Biomedicine rising from about 14% to 26%.

For business readers, the practical implication is not “replace biomedical scientists”. Calm down. The more plausible value is a new operating layer for research informatics: agents that preserve successful analytical workflows, expand tool coverage, and reduce the friction of moving from literature to data to executable analysis. That matters for pharma R&D, translational medicine, clinical bioinformatics, and any organisation drowning in specialised tools held together by senior scientists, undocumented scripts, and polite despair.

The boundary is clear. The paper mainly evaluates computational question-answering and agent workflows. It does not prove autonomous wet-lab discovery, clinical reliability, regulatory readiness, or experimentally confirmed therapeutic value. STELLA is a strong argument for adaptive research agents. It is not yet a robot Nobel laureate in a lab coat.

The real problem is not intelligence; it is tool drift

Biomedical research does not fail because scientists lack curiosity. It fails, slowly and expensively, because the ecosystem around a question is fragmented.

A single research problem may require PubMed search, variant databases, protein structures, single-cell analysis, differential expression tools, pathway enrichment, molecular models, custom scripts, and then someone senior enough to notice that the pipeline answered the wrong version of the question. That last role is usually performed by a human expert, after coffee, and before a grant deadline.

Most AI agents handle this world with a fixed toolbox. They can retrieve, call APIs, run code, or generate reports, but their capabilities are typically pre-defined. In a stable domain, that is tolerable. In biomedical research, where new datasets, models, assays, and repositories constantly appear, static agents become outdated quickly. The agent may still sound confident, which is worse. A stale agent with a fluent voice is just bureaucracy with embeddings.

STELLA’s premise is that biomedical AI should not merely answer questions. It should improve the way it answers questions. That is the mechanism-first story: the architecture matters because biomedical work is not a single reasoning act. It is an evolving chain of decisions, tools, checks, and pivots.

STELLA is built as a research loop, not a chatbot with medical vocabulary

The paper describes STELLA as a multi-agent system with four main roles.

Component	What it does	Why it matters operationally
Manager Agent	Decomposes the research prompt into a reasoning pathway and coordinates the workflow	Turns broad scientific intent into a staged plan
Dev Agent	Builds environments, writes code, runs analyses, trains or applies models, and drafts reports	Makes the agent executable rather than merely verbal
Critic Agent	Reviews intermediate outputs and identifies conceptual or methodological gaps	Prevents early plausible answers from becoming final wrong answers
Tool Creation Agent	Searches for missing capabilities, builds or integrates tools, tests them, and adds them to the Tool Ocean	Allows the system to expand beyond its starting toolkit

The important design choice is the loop among these roles. The Manager plans. The Dev executes. The Critic evaluates. If the Critic identifies a missing capability, the Manager can send the problem to the Tool Creation Agent. A new tool may then be added to the Tool Ocean and used in the current or future workflow.

This is closer to a computational research office than a single assistant. The value is not that every sub-agent has a cute title. The value is separation of responsibilities. Planning, execution, critique, and tool expansion are different failure surfaces. Combining them into one prompt is convenient; separating them is engineering.

The Template Library turns solved problems into reusable strategy

The first self-evolving mechanism is the Template Library. When STELLA completes a successful workflow, the system can distil that workflow into a reusable reasoning template.

This matters because scientific work often has recurring shapes. Drug resistance questions may require comparing resistant and sensitive cell states, building regulatory networks, and identifying a keystone gene. Drug repurposing may require disease signatures, library screening, and binding affinity prediction. Literature-heavy tasks may require retrieval, synthesis, contradiction checking, and evidence grading.

The Template Library is a way to preserve the strategic structure of a successful analysis, not just the final answer. That distinction is important. A knowledge base stores facts. A template library stores ways of moving from question to evidence.

For operators, this is where the system starts to resemble an organisational memory layer. Many R&D teams already have this memory, but it lives inside senior scientists, Slack threads, notebooks, local scripts, and the occasional heroic spreadsheet. STELLA formalises the idea that successful research workflows should be captured, reused, and improved.

There is a catch. A template is useful only if it generalises to the next problem without smuggling in assumptions from the previous one. The paper frames the Template Library as a mechanism for learning from experience, but it does not provide a deep component-level ablation showing exactly how much each template contributes across task families. So the business interpretation should be measured: template evolution is a promising architecture for workflow reuse, not yet a complete theory of scientific generalisation.

The Tool Ocean is the sharper idea

The second mechanism, the Tool Ocean, is more operationally disruptive.

Traditional agents are limited by the tools their designers install. STELLA’s Tool Creation Agent can identify a capability gap, search resources such as PubMed or GitHub, build or integrate a tool, test and debug it, and add it to the available toolset. The paper groups tools into three broad categories: database-query functions, interfaces to large biomedical foundation models, and custom analysis scripts. Examples include PubMed, ClinVar, PDB, AlphaFold 3, scGPT, ESM3, and specialised scripts for network analysis or data integration.

This changes the agent from a user of infrastructure into a partial maintainer of infrastructure. That is the business-relevant move.

In biomedical organisations, tool integration is often the quiet tax on research productivity. Teams do not merely need answers; they need environments, dependencies, wrappers, data access, versioning, and enough validation to avoid producing elegant nonsense. If an agent can reduce the manual burden of discovering and integrating tools, it can shorten the distance between a scientific question and an executable analysis.

The phrase “Tool Ocean” is slightly dramatic, but the underlying point is practical. Static toolboxes age. Dynamic tool inventories can, in principle, adapt. The hard part is governance: deciding which newly created tools are reliable, reproducible, secure, licensed appropriately, and safe to use in regulated workflows. An ocean is useful. It is also where people drown.

The chemotherapy-resistance example shows the loop, not a clinical breakthrough

The paper’s illustrative workflow centres on acquired chemotherapy resistance. STELLA receives a prompt asking it to uncover the mechanism of resistance in a patient’s tumour and propose a targeted re-sensitisation strategy. The Manager Agent creates a reasoning pathway. The Dev Agent preprocesses pre-treatment and post-relapse single-cell RNA sequencing data, annotates cell states, runs differential analysis, and drafts an initial report.

Then the Critic Agent intervenes. The key critique is that describing what changed is not enough. The system needs to explain the regulatory logic and find a “keystone” gene maintaining the resistant state. This triggers the Tool Creation Agent to create a virtual perturbation screening tool using a single-cell perturbation prediction model. The workflow then identifies MTF1 as the master transcription factor controlling the resistance network.

This example is useful, but it should be read correctly.

Paper element	Likely purpose	What it supports	What it does not prove
Chemotherapy-resistance workflow	Implementation detail and illustrative case	Shows how Manager, Dev, Critic, and Tool Creation Agent interact in a realistic biomedical analysis pattern	Does not establish MTF1 as a clinically validated target
Critic intervention	Mechanism demonstration	Shows why critique matters: descriptive analysis may be scientifically insufficient	Does not prove the critic is consistently reliable across all domains
Virtual perturbation tool creation	Capability-expansion demonstration	Shows the Tool Ocean concept in action	Does not prove newly created tools are production-grade
Human expert / wet experiment loop in the diagram	Deployment boundary	Acknowledges that expert and experimental feedback remain part of the process	Does not mean such validation was completed in the benchmark evaluation

The correction is simple: STELLA is not being presented as an autonomous wet-lab scientist that discovered a therapy. It is being presented as an adaptive computational agent that can move from descriptive analysis toward a more actionable hypothesis by creating or integrating the missing tool. That is already significant. It is just not the same as curing cancer, despite what a less disciplined headline might imply.

The benchmark results are strongest as evidence of architecture, not magic

The paper evaluates STELLA against leading general LLMs and biomedical agents on three biomedical question-answering benchmarks. The headline results are:

Benchmark	STELLA reported accuracy	Business interpretation	Boundary
Humanity’s Last Exam: Biomedicine	~26%	Strong relative performance on hard biomedical reasoning questions	Evaluation uses a selected set of 50 representative questions following prior sampling practice
LAB-Bench DBQA	~54%	Better handling of database-style biomedical question answering	Uses a 12.5% sampled subset of the benchmark
LAB-Bench LitQA	~63%	Better literature question answering under the benchmark protocol	Multiple-choice format allows structured evaluation but is not equivalent to open-ended research success

The paper reports that STELLA outperforms other tested systems, including general models such as Gemini 2.5 Pro, Claude 4 Opus, DeepSeek-R1, OpenAI o3, and the biomedical agent Biomni. The authors say the advantage reaches up to 8 percentage points in the main results section, while the abstract says up to 6 percentage points. That discrepancy is not fatal, but it is worth not embellishing. The safer reading is that STELLA reports leading benchmark performance by a modest but meaningful margin.

For operators, the magnitude matters. A jump from 14% to 26% on HLE Biomedicine is large in relative terms, but 26% is still not “solved”. A result around 63% on LitQA is useful evidence of capability, but it is not a licence to automate biomedical judgement without review. Benchmark leadership is a signal. It is not a compliance framework.

The main evidence is not just that STELLA scores higher. It is that the design appears to convert additional test-time computation into better answers.

Test-time improvement is the paper’s most important performance claim

Figure 2B reports STELLA’s performance as computation budget increases from 1x to 9x trials. The reported results are averages across three independent evaluation runs. HLE Biomedicine rises from about 14% to 26%. LitQA rises from about 52% to 63%. DBQA also improves, reaching about 54%.

This is the closest the paper gets to directly validating the “self-evolving” claim. The result suggests STELLA can use repeated attempts to refine strategies, correct errors, and improve final answers. In ordinary agent terms, this is test-time scaling. In business terms, it suggests a useful trade-off: spend more compute when the question is important enough.

That trade-off is familiar in professional services. You do not spend the same effort on a casual literature query and a high-stakes translational hypothesis. STELLA’s test-time behaviour hints at an agentic version of escalation: cheap pass first, deeper iterative analysis when stakes justify cost.

But the interpretation needs discipline. More trials improving benchmark accuracy does not prove that the system is permanently learning in the full organisational sense. It shows performance improvement under increased computation during evaluation. The Template Library and Tool Ocean suggest memory and capability expansion, but the paper does not deeply quantify long-horizon accumulation across many real research programmes. The evidence supports test-time self-improvement; production-grade continual learning remains a separate question.

What STELLA directly shows, what business can infer, and what remains open

The cleanest way to read the paper is to separate three layers.

Layer	Claim	Status
Direct paper evidence	STELLA combines Manager, Dev, Critic, and Tool Creation agents with Template Library and Tool Ocean mechanisms	Directly described in the architecture
Direct paper evidence	STELLA reports leading results on HLE Biomedicine, LAB-Bench DBQA, and LAB-Bench LitQA	Directly reported in benchmarks
Direct paper evidence	Accuracy improves as computation budget increases across trials	Directly reported as test-time self-evolving effect
Cognaptus inference	Adaptive agent infrastructure could reduce tool-integration and workflow-reuse bottlenecks in biomedical R&D	Plausible operational implication
Cognaptus inference	Template capture could become a form of institutional research memory	Plausible, but depends on governance and generalisation
Still uncertain	Whether STELLA-created tools are consistently reliable, reproducible, compliant, and secure in production	Not established by the reported benchmarks
Still uncertain	Whether benchmark improvements translate into experimentally validated biological discoveries	Not established
Still uncertain	Whether the system remains robust across messy proprietary datasets, incomplete metadata, and regulated clinical workflows	Not established

This separation matters because the worst version of biomedical AI strategy is to confuse a benchmark with a business case. The second-worst version is to dismiss benchmarks entirely because they are not the real world. STELLA deserves neither worship nor eye-rolling. It deserves architectural attention.

The likely business value is in research operations, not replacing scientists

The most immediate application area is not autonomous discovery in the cinematic sense. It is research operations.

A pharma or biotech organisation could use a STELLA-like architecture to make internal research workflows more reusable. Instead of every team rebuilding analysis pipelines for resistance mechanisms, variant interpretation, target discovery, or literature review, successful workflows could become templates. Instead of waiting for a human engineer to wrap every new tool, an agent could propose integrations, run tests, and prepare reviewable tool modules.

That does not remove scientists. It changes where scientists spend time. Less time on finding the right package version. More time on asking whether the analysis is biologically meaningful. A civilised arrangement, frankly.

The economic logic is straightforward:

Technical contribution	Operational consequence	ROI relevance
Template Library	Reuses proven reasoning pathways across similar research problems	Reduces repeated senior-scientist effort
Tool Creation Agent	Identifies and integrates missing computational capabilities	Reduces tool-discovery and pipeline-setup friction
Critic Agent	Challenges intermediate results before final reporting	Lowers risk of plausible but incomplete analysis
Test-time scaling	Allows deeper effort for higher-stakes questions	Enables compute allocation by business priority
Multi-agent separation	Makes planning, execution, critique, and tooling inspectable	Supports audit and process design better than one opaque prompt

The strongest buyer is not necessarily a bench scientist asking casual questions. It may be a computational biology platform team, translational research group, or AI infrastructure team trying to standardise how research questions become executable workflows.

Deployment will be governed by trust infrastructure

STELLA’s architecture creates value precisely where it also creates risk.

If an agent can create tools, then tool validation becomes central. Who approves a new tool? How are dependencies pinned? How are data licences checked? How is output audited? How are failed tool creations logged? How does the system prevent a useful script from becoming an unreviewed clinical decision aid?

For regulated biomedical environments, the Tool Ocean cannot be an uncontrolled soup of scripts and APIs. It needs layers: sandboxed experimentation, review queues, validated tool registries, provenance tracking, access control, and human sign-off for high-impact use. The paper’s figure includes human expert and wet experiment feedback in the loop, which is the right instinct. The business implementation needs to make that loop enforceable, not decorative.

The Template Library also needs governance. A template that worked for one dataset may fail when assumptions change. If templates become organisational memory, they can also become organisational bias. The system should record not only successful workflows, but the conditions under which they succeeded, the datasets used, the assumptions made, and the failure modes observed.

In other words, self-evolving agents require self-disciplining infrastructure. Evolution without selection pressure is just mutation with a nicer slide deck.

The missing ablation is important

The paper’s results support the integrated system, but they do not fully isolate the contribution of each mechanism. We do not get a detailed ablation showing STELLA without the Template Library, without Tool Ocean expansion, without the Critic Agent, or with a fixed toolset under otherwise identical conditions.

That matters for buyers. If most of the performance gain came from stronger base models and more test-time trials, the implementation priority would differ. If the Tool Creation Agent drives the hardest gains, investment should focus on secure tool integration and validation. If the Template Library contributes most to repeated domain workflows, the key product layer is workflow memory and template governance.

The current paper is sufficient to make STELLA worth watching. It is not sufficient to decide exactly which subsystem deserves the largest enterprise budget. Annoying, but useful to know before procurement discovers adjectives.

The strategic signal: biomedical AI is moving from answer engines to adaptive infrastructure

The broader shift is clear. Biomedical AI systems are moving beyond retrieval and response. The next useful layer is adaptive infrastructure: systems that can plan, execute, critique, expand tools, and preserve reusable workflows.

STELLA is an example of that shift. It does not merely say, “Here is an answer.” It tries to ask, “What workflow is needed? What tool is missing? What should be checked before the answer is final? What should be remembered for next time?”

That is a better framing for enterprise adoption. Organisations do not need AI agents that pretend to be scientists. They need systems that make scientific work less fragmented, less repetitive, and more inspectable. STELLA’s contribution is to show one architecture for doing that in biomedical research.

The future biomedical AI stack will likely include static curated tools, dynamic tool integration, reasoning templates, code execution, model interfaces, retrieval systems, audit trails, and human validation. The winning systems will not be those with the loudest “autonomous scientist” branding. They will be the ones that make the research process faster without making it ungovernable.

STELLA is not the end of biomedical discovery. It is a credible sketch of the middleware layer between human scientific judgement and the expanding mess of computational biology. In this field, that may be enough to matter.

Cognaptus: Automate the Present, Incubate the Future.

Ruofan Jin, Zaixi Zhang, Mengdi Wang, and Le Cong, “STELLA: Self-Evolving LLM Agent for Biomedical Research,” arXiv:2507.02004, 2025. https://arxiv.org/abs/2507.02004 ↩︎

TL;DR for operators#

The real problem is not intelligence; it is tool drift#

STELLA is built as a research loop, not a chatbot with medical vocabulary#

The Template Library turns solved problems into reusable strategy#

The Tool Ocean is the sharper idea#

The chemotherapy-resistance example shows the loop, not a clinical breakthrough#

The benchmark results are strongest as evidence of architecture, not magic#

Test-time improvement is the paper’s most important performance claim#

What STELLA directly shows, what business can infer, and what remains open#

The likely business value is in research operations, not replacing scientists#

Deployment will be governed by trust infrastructure#

The missing ablation is important#

The strategic signal: biomedical AI is moving from answer engines to adaptive infrastructure#