SAGA, Not Sci‑Fi: When LLMs Start Doing Science

Science usually fails in a boring way.

Not with explosions. Not with a robot dramatically discovering penicillin 2.0 while violins swell in the background. More often, a research workflow fails because somebody optimized the wrong thing a little too efficiently.

A molecule scores well but is chemically ugly. A nanobody looks good under one predictor but fails to bind. A DNA enhancer activates the target cell line but also lights up the wrong tissue. A separation process reaches high purity by adding pointless unit operations, because the reward function forgot to punish industrial nonsense. The optimizer did its job. Unfortunately, the job description was incomplete.

That is the useful idea behind SAGA, the Scientific Autonomous Goal-evolving Agent introduced in Accelerating Scientific Discovery with Autonomous Goal-evolving Agents.¹ The paper is not interesting merely because it adds LLM agents to scientific design. Everyone is adding LLM agents to everything; at this point, a toaster with a planning module is only a funding round away. The more important move is architectural: SAGA treats the objective function itself as something to discover.

The accepted framing for this article is mechanism-first, because a domain-by-domain tour would make SAGA look like a grab bag of demonstrations: antibiotics, nanobodies, DNA sequences, inorganic materials, chemical processes. Impressive, yes. Also cognitively messy. The reusable contribution is not that one agent can dabble in many scientific areas. It is that scientific discovery often needs a second optimization loop above the obvious one.

The inner loop asks: “Given this scoring function, what candidate looks best?”

The outer loop asks the more dangerous question: “Is this still the right scoring function?”

That second question is where the paper earns attention.

The bottleneck is not always the generator

A familiar story says that AI discovery improves when the generator gets stronger. Better model, better search, better candidate. Sometimes that is true. But SAGA targets a different bottleneck: fixed objectives are often incomplete proxies for scientific success.

This matters because optimization systems are very obedient. If a reward function says “maximize predicted activity,” the optimizer may find candidates that exploit model artifacts, chemical shortcuts, or unbalanced trade-offs. The candidate can look excellent by the score and suspicious by the lab notebook. In business language, the KPI was met, and the business still lost money. A classic achievement.

SAGA’s answer is a bi-level structure:

Layer	Question	Main function	Failure it tries to catch
Inner loop	What candidate best satisfies the current objectives?	Generate, score, and select candidates	Slow or weak search
Outer loop	What objectives should the system optimize next?	Analyze failure modes, propose objectives, implement scoring functions	Reward hacking, missing constraints, bad trade-offs
Human-control layer	How much should experts steer the loop?	Co-pilot, semi-pilot, or autopilot mode	Blind automation or excessive manual bottlenecks

The four core modules make this more concrete. The Planner proposes measurable objectives. The Implementer turns those objectives into executable scoring functions, or rejects objectives that cannot be computed reliably. The Optimizer runs the candidate search. The Analyzer reviews score trends and candidate-level features, identifies bottlenecks, and tells the Planner where the objective set is failing.

This is not “LLM writes a hypothesis and hopes science applauds.” It is closer to an automated version of what disciplined scientists already do: inspect failed candidates, notice that the scoring criteria were incomplete, add a missing constraint, and rerun the search.

The paper’s useful abstraction is therefore not “agentic science.” It is objective governance.

SAGA turns scientific judgment into an iterative control loop

The paper defines three operating modes, and the distinction is important because the system is not pretending that autonomy is always the mature option.

In co-pilot mode, scientists collaborate with both the Planner and Analyzer. They can revise proposed objectives and critique analysis before execution continues. This is the most controlled form, useful when domain judgment is expensive but essential.

In semi-pilot mode, humans intervene mainly at the Analyzer stage. They review outcomes, provide strategic feedback, and let the system translate that feedback into objectives. This is closer to an expert supervisor guiding an automated team.

In autopilot mode, the Planner, Implementer, Optimizer, and Analyzer run without human intervention. This is attractive when the design space, constraints, and computational validators are mature enough to justify hands-off exploration.

That last condition is not a footnote; it is the point. Autopilot is not magic. It is a workflow mode that becomes reasonable only when the system has reliable ways to score progress and diagnose failure. SAGA does not remove the need for scientific judgment. It redistributes it across the workflow: some judgment appears in the initial goal and context, some in human feedback, some in scoring-function libraries, and some in the Analyzer’s structured critique.

The paper’s architecture is also modular. The Optimizer can be an LLM-based evolutionary algorithm, a genetic algorithm, reinforcement learning, or another domain-specific engine. SAGA is not claiming that one generative model rules them all. The outer loop is the general idea; the inner loop can be swapped.

That makes the framework more credible. Scientific domains do not share one natural representation. Molecules, proteins, DNA sequences, crystals, and process flowsheets are not interchangeable toys. A general framework must generalize at the level of workflow control, not at the level of pretending every object is just text.

The antibiotic result shows why “more activity” is not enough

The antibiotic task is the clearest example of objective evolution doing useful work.

SAGA starts with primary goals such as antibacterial activity, novelty, safety, and synthesizability. Candidates are evaluated using biological and chemical criteria, including predicted antibacterial activity, novelty, toxicity, known motif filters, QED drug-likeness, synthetic accessibility, and other drug-likeness filters. The main comparison is not just “did SAGA generate molecules?” but whether it avoided the common failure where activity optimization damages chemical realism.

The paper reports that fixed-objective baselines struggled to balance activity and practical medicinal chemistry. The standalone optimizer version, SAGA-Opt, over-optimized the primary biological objectives and produced candidates with lower drug-likeness. That is the main ablation logic: remove the outer objective-evolving loop, and the system becomes a more ordinary optimizer with more ordinary failure modes.

The Analyzer then does what a useful reviewer should do. It identifies population-level patterns, such as a negative relationship between predicted antibacterial activity and drug-likeness. It also points to structural problems in high-scoring candidates, including metabolically labile motifs such as primary amines, phenols, and morpholines. The Planner responds by adding objectives such as metabolic stability and custom drug-likeness filtering.

This is the mechanism in miniature:

Optimize under the current objectives.
Inspect where the population is becoming scientifically unattractive.
Add a computable objective for the missing concern.
Rerun the search under the updated target.

The wet-lab validation is the paper’s strongest reality check. The authors synthesized the top 28 molecules designed by SAGA and tested them against E. coli. Four compounds showed more than 80% growth inhibition at 128 μg/mL when combined with polymyxin B nonapeptide, which increases bacterial-cell permeability. Follow-up testing found that compound 4 had an MIC of 16 μg/mL, while the other three had MICs of 128 μg/mL. Among them, compound 8 showed minimal cytotoxicity in both HEK293 and HepG2 human cell lines at its MIC. Compound 8 was also structurally novel, with Tanimoto similarity of 0.28 to known antibiotics, below the paper’s cited novelty threshold of 0.4.

This is not a finished antibiotic. The authors explicitly note that potency and permeability still need improvement. But for a hit-discovery stage, the result is meaningful: SAGA did not merely produce high-scoring fantasies. It produced experimentally testable candidates, one of which combined activity, novelty, and low observed cytotoxicity under the tested conditions.

The business interpretation is modest but important. The value is not “AI discovers drugs now, please replace the chemistry department.” The value is earlier triage: fewer reward-hacked candidates, better detection of missing constraints, and a clearer record of why objectives changed.

The nanobody result tests whether a composite objective can beat familiar metrics

The nanobody section asks a related but different question. In protein binder design, there are many in silico metrics: structure-prediction confidence, binding interface scores, hydrogen bonds, salt bridges, buried surface area, epitope contacts, sequence-structure compatibility, and developability penalties. The problem is not a lack of metrics. It is knowing which combination matters for this target.

The target here is PD-L1, a clinically relevant immune checkpoint protein. SAGA designs de novo nanobodies and compares its candidates with outputs from BoltzGen and Germinal. The evaluation spans structural quality, binding interface characteristics, epitope engagement, sequence-structure compatibility, developability, and computational efficiency. To reduce dependence on a single predictor, the paper uses both AlphaFold3 and Boltz2 for structure prediction.

Again, the main story is objective evolution. In one workflow, experts notice that CDR regions have low confidence despite decent global metrics. SAGA responds by adding pLDDT-based structural confidence terms, ProteinMPNN compatibility terms, and a CDR3 alpha-helix constraint. Later, when the helix exists but does not necessarily engage the relevant PD-L1 epitope, SAGA adds hotspot-contact objectives.

That sequence matters. A naive optimizer might stop at “the structure looks confident.” SAGA’s loop asks whether the confident structure is doing the right biological job. Confidence is useful. Functional contact is better.

The experimental validation is compact but important. The authors tested 24 SAGA-designed nanobody candidates using biolayer interferometry and found three true PD-L1 binders, with dissociation constants in the 300–400 nM range. The binders were also novel by CDR3 sequence comparison: their CDR3 sequences shared less than 20% similarity with nanobodies designed by the compared methods and with antibodies in SAbDab. One predicted structure, nanobody A2, used a non-canonical alpha-helical CDR3 rather than the more familiar loop topology.

The most interesting statistic is not only that three binders were found. It is that no single in silico metric significantly separated binders from non-binders, while the composite scoring function evolved by SAGA did, with $p = 0.03$ by the paper’s Mann–Whitney U test.

That result should be read carefully. It does not prove that SAGA has discovered a universal nanobody scoring law. It suggests something narrower and more operationally useful: for this design task, the evolved composite objective captured signal that individual metrics missed. In practical R&D, that is often exactly the missing layer. Teams do not need one perfect metric; they need a disciplined way to assemble, test, and revise imperfect metrics.

The DNA and materials results show objective evolution beyond wet-lab validation

The functional DNA sequence experiments extend the argument to regulatory sequence design. SAGA is used to design HepG2-specific enhancers. The initial objectives target predicted MPRA activity in HepG2 while minimizing activity in off-target cell lines such as K562 and SKNSH.

The risk is familiar: optimizing expression can produce sequences that score well under one predictor but lack biological plausibility, diversity, or stability. SAGA adds objectives for hepatocyte motif enrichment, GC-content-related stability, and off-target suppression. The paper reports that SAGA outperforms baselines across multiple metrics, with especially large gains in MPRA specificity and motif enrichment under controlled comparisons. It also reports that SAGA-designed enhancers recover liver-specific transcription factor motifs and show supportive signals under independent regulatory predictors such as CAGE-seq and DNase-seq through Enformer-based evaluation.

Here the evidence is computational rather than wet-lab. That does not make it useless, but it changes the interpretation. The DNA results support the claim that objective evolution can improve biological plausibility under held-out and cross-modality predictors. They do not prove in vivo enhancer function. The paper’s own design makes this clear: the evaluation is a blind computational assessment, not a physical assay.

The materials section makes the same architecture work on a different object class. For permanent magnets, SAGA starts with objectives such as magnetic density and low supply-chain risk, using the Herfindahl–Hirschman Index as a proxy for avoiding risky elements such as rare earths. For superhard materials, it starts with bulk and shear modulus, then evolves objectives around hardness, brittleness, and thermodynamic stability. DFT calculations are used for evaluation.

The important point is not that SAGA magically knows materials science. It uses domain tools and scoring functions. The contribution is that it can notice when a fixed target is too narrow. In the superhard-materials example, minimizing formation energy alone can push the search toward stable but mechanically weak regions. SAGA’s Analyzer identifies the misalignment and adds objectives more directly tied to cutting-tool performance, such as hardness and Pugh-ratio-related brittleness.

This is exactly how multi-objective work usually becomes painful: one metric improves by quietly sacrificing another. The outer loop makes that trade-off visible.

Chemical process design exposes the industrial version of reward hacking

The chemical process design task may be the most business-relevant example, because the failure mode is easy to understand.

The baseline reinforcement-learning agent optimizes product purity for separation flowsheets. Product purity is obviously important. It is also insufficient. If the reward does not penalize unnecessary operations, fragmented flows, or poor robustness across feed compositions, the agent can design processes that look successful under the main target while being awkward or inefficient in practice.

SAGA adds outer-loop analysis over text representations of flowsheets. The Analyzer identifies issues such as excessive process complexity, too many product streams, feed-composition sensitivity, and loopholes where “no separation” can still score deceptively under certain objectives. The Planner then proposes objectives such as process complexity, component recovery, material efficiency, and flow-intensity penalties.

This is not just a chemical engineering story. It is a clean example of enterprise AI failure. If the system optimizes a narrow KPI, it will often discover ways to satisfy the dashboard while irritating reality. In operations, logistics, pricing, compliance, customer service, and procurement, the same pattern appears under different costumes.

The useful business lesson is not “let agents make process designs.” It is: every automated optimization loop needs a second loop that audits whether the target still represents the business goal.

How to read the evidence without getting dazzled

Because the paper covers five domains, it is easy to overread the breadth as the main contribution. A better reading separates the evidence by purpose.

Paper component	Likely purpose	What it supports	What it does not prove
SAGA vs. SAGA-Opt	Ablation	The outer objective-evolving loop improves over fixed-objective optimization	That every proposed objective is scientifically optimal
Antibiotic wet-lab tests	Main experimental validation	SAGA can produce testable, novel hit candidates with activity and safety signals	That the discovered compound is a ready drug
Nanobody BLI tests	Main experimental validation	SAGA-designed candidates can bind PD-L1, and the evolved composite score may rank binders better than single metrics	Universal binder-scoring validity
DNA held-out and cross-modality predictors	Computational generalization and robustness-style evidence	Designed sequences are not merely optimizing one MPRA predictor	Actual in vivo function
Materials DFT evaluation	Independent computational validation	Candidate structures satisfy relevant predicted physical-property criteria	Manufacturability or commercial deployment
Chemical process task	Implementation and workflow evidence	SAGA can guide an RL process designer away from narrow reward failures	Industrial-scale process readiness
Supplementary ablations and objective-transfer tests	Robustness, implementation checks, and exploratory extensions	The Implementer and evolved objectives can be useful beyond one run	Full reliability across unknown domains

This table is where the paper becomes more credible, not less. The strongest claims are supported by physical validation in antibiotics and nanobodies. The broader domain claims are supported by computational evaluations and workflow demonstrations. Those are valuable, but they sit at a different evidentiary level.

Good interpretation keeps those levels separate. Hype mixes them into soup.

The business value is objective governance, not automated genius

For business readers, SAGA should not be filed under “AI scientist replaces scientists.” That category is mostly good for conference panels and mild investor fever.

The practical category is adaptive optimization governance.

Many companies already optimize against proxy metrics. Lead scoring, churn prediction, credit risk, inventory planning, ad bidding, fraud detection, supply-chain routing, and R&D prioritization all depend on measurable objectives that only approximate the real goal. When those proxies are incomplete, the optimizer can create hidden damage: low-quality leads, customer annoyance, compliance risk, fragile recommendations, or operational complexity that was never charged to the reward function.

SAGA suggests a reusable enterprise pattern:

Define the high-level business or scientific goal.
Start with known objectives and trusted scoring functions.
Run optimization.
Analyze where the outputs are bad despite good scores.
Add or revise objectives.
Implement the new checks as executable scoring functions.
Repeat, with human involvement where judgment remains expensive.

This turns “the model made a weird recommendation” into a structured workflow question: which missing objective allowed that recommendation to look good?

That is a much better question than “should we use a bigger model?” Bigger models are wonderful. They are also perfectly capable of optimizing the wrong thing with greater fluency.

Where SAGA still depends on the world being computable

The limitation section should be precise, because SAGA’s boundaries are not cosmetic.

First, SAGA works best when objectives can be computationally scored or connected to expert or lab feedback. If the desired property cannot be measured, simulated, predicted, or reviewed in a usable loop, the system has nothing reliable to optimize against. The paper acknowledges that problems without computational validation would require human experts or autonomous lab-in-the-loop systems.

Second, the current tasks still predefine the design space. SAGA is asked to design small molecules, nanobodies, DNA sequences, materials, or flowsheets. It does not begin from a broad goal such as “cure this disease” and autonomously decide whether the appropriate modality is a small molecule, antibody, RNA therapy, cell therapy, or something else. That higher-level problem formulation remains outside the current scope.

Third, the Implementer is powerful but not infallible. Turning a natural-language objective into executable code is itself a risk surface. The paper includes checks such as implementation validation, Dockerized execution, and comparisons between human-implemented and agent-implemented scoring functions in some ablations. Still, in production settings, generated scoring functions would need review, version control, tests, and audit trails. Yes, the boring software engineering part returns. It always does.

Finally, objective evolution can introduce instability. The chemical process section makes this visible: changing objectives and weights can affect reinforcement-learning training behavior. Adding more objectives is not automatically better. A good outer loop needs restraint, not just creativity.

The conclusion: the agent learns what success should mean

SAGA’s strongest idea is not that LLMs can “do science” in a cinematic sense. It is that parts of scientific judgment can be organized into a repeatable control loop.

The paper shows this loop across several scientific settings. In antibiotics, objective evolution helps avoid chemically unattractive high-activity candidates and leads to experimentally validated hits. In nanobodies, it assembles a composite scoring function that distinguishes binders better than individual metrics in the tested set. In DNA and materials, it improves computationally evaluated plausibility across multi-objective design tasks. In process engineering, it catches the industrially familiar problem of optimizing purity while ignoring complexity.

The broader lesson is almost embarrassingly practical: discovery improves when the system is allowed to revise what it is trying to optimize.

That does not make SAGA a fully autonomous scientist. It makes it something more useful at this stage: a framework for making optimization less naive.

For businesses building AI agents, that is the part worth stealing. Do not just build agents that act. Build agents that inspect the consequences of their own scoring rules, propose better ones, and keep humans in the loop where reality is still too expensive to compress into a number.

Because if the KPI is wrong, the optimizer will not save you.

It will help you fail faster, with excellent documentation.

Cognaptus: Automate the Present, Incubate the Future.

Yuanqi Du et al., “Accelerating Scientific Discovery with Autonomous Goal-evolving Agents,” arXiv:2512.21782, 2026. ↩︎

The bottleneck is not always the generator#

SAGA turns scientific judgment into an iterative control loop#

The antibiotic result shows why “more activity” is not enough#

The nanobody result tests whether a composite objective can beat familiar metrics#

The DNA and materials results show objective evolution beyond wet-lab validation#

Chemical process design exposes the industrial version of reward hacking#

How to read the evidence without getting dazzled#

The business value is objective governance, not automated genius#

Where SAGA still depends on the world being computable#

The conclusion: the agent learns what success should mean#