Scientific Discovery

SAGA, Not Sci‑Fi: When LLMs Start Doing Science

Science usually fails in a boring way. Not with explosions. Not with a robot dramatically discovering penicillin 2.0 while violins swell in the background. More often, a research workflow fails because somebody optimized the wrong thing a little too efficiently. A molecule scores well but is chemically ugly. A nanobody looks good under one predictor but fails to bind. A DNA enhancer activates the target cell line but also lights up the wrong tissue. A separation process reaches high purity by adding pointless unit operations, because the reward function forgot to punish industrial nonsense. The optimizer did its job. Unfortunately, the job description was incomplete. ...

From Benchmarks to Beakers: Stress‑Testing LLMs as Scientific Co‑Scientists

Benchmarks are clean. Research is not. A benchmark asks a model to answer a question, then politely stops. A research workflow asks the model to form a hypothesis, test it, read the result, notice what went wrong, adjust the plan, and try again without wandering into scientific nonsense. One is a quiz. The other is a beaker with a budget, a deadline, and a surprisingly expensive simulation queue. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

Procurement meetings have a habit of turning AI agents into theatre. A vendor shows a polished research assistant. It finds papers, writes a summary, cites sources, maybe generates a small experiment plan. Everyone nods. Someone says “agentic workflow.” Someone else says “autonomous discovery.” A budget appears. The machine is declared practically scientific, which is convenient, because the machine itself has not yet been asked to survive the boring parts of science: retrieval under controlled conditions, code execution, data analysis, experimental reproduction, hypothesis testing, and the small matter of completing all required steps without wandering into the digital bushes. ...

Forecasting a Smarter Planet: How EarthLink Reimagines Climate Science with Self-Evolving AI Agents

TL;DR for operators Climate work is not short of data. It is short of usable pathways through data. EarthLink, the system studied in this paper, is best understood as an orchestration layer for climate science: it plans analyses, retrieves relevant data, generates code, runs diagnostics, checks results, produces reports, and stores validated query-code-result patterns for reuse.1 ...

The Rise of the Self-Evolving Scientist: STELLA and the Future of Biomedical AI

TL;DR for operators STELLA is not interesting because it calls itself a “self-evolving scientist”. The internet has suffered enough from ambitious nouns. It is interesting because it attacks a real operational bottleneck in biomedical research: the best answer often requires not just reasoning, but finding the right database, building the right analysis environment, running code, checking intermediate results, and deciding when the current workflow is inadequate. ...