AI for Science

The Test Suite Passed. The Physics Did Not.

TL;DR for operators Nguyen’s paper is not another “AI writes code” victory lap. It is more useful than that. It documents a 12-work-day, 57-session case in which a physicist supervised Claude Code, using Sonnet and Opus models, to build clax-pt, a JAX implementation of a differentiable one-loop perturbation theory module validated against the established C reference code class-pt.1 ...

Laws and Order: Turning LLM Brainstorming into a Research Hypothesis Workflow

Brainstorming Is Cheap; Research Judgment Is Not Brainstorming with an LLM is easy. Ask for ten research ideas, wait a few seconds, and receive a confident menu of things that sound just plausible enough to be dangerous. Turn up the temperature and the machine becomes “creative.” Wonderful. We have successfully automated the whiteboard intern. ...

Agents in the Lab: When Bayesian Adversaries Keep AI Scientists Honest

Lab work has an old rule: never trust the first beautiful result. It may be correct. It may also be a measurement artifact wearing a lab coat. That rule becomes more important when the “research assistant” is an LLM that can write code, invent tests, explain errors, and occasionally hallucinate with the confidence of a junior consultant who has just discovered PowerPoint. The paper “AI-for-Science Low-code Platform with Bayesian Adversarial Multi-Agent Framework” takes this problem seriously.1 Its central claim is not that scientific automation needs a larger model, a longer prompt, or another cheerful agent named “Planner.” The claim is sharper: in AI-assisted scientific coding, both the generated code and the generated tests are uncertain. If the validator is also an LLM, then the system has not solved hallucination. It has merely hired hallucination as compliance staff. ...

From Prompt Engineering to Context Engineering: Why Typed Graphs Beat Chatty Agents in the Lab

A lab workflow is a terrible place to discover that your AI agent has been “remembering” chemistry as a conversation. That sounds unkind. It is also the point. In a casual chatbot, losing track of context means an awkward answer. In computational chemistry, losing track of context can mean a wrong molecular geometry, a missing imaginary-frequency check, an invalid charge or multiplicity, or a pKa estimate that looks numerically confident while being scientifically useless. The model did not necessarily become stupid. The workflow around it treated state as text. ...

Agents All the Way Down: When Science Becomes Executable

A lab does not fail because the scientist forgot how to think. It fails more often for duller reasons: the data table is in the wrong format, the simulation script only works on one cluster, the instrument queue is opaque, the boundary condition was changed but not logged, the literature trail cannot be reconstructed, and the “promising result” lives in someone’s notebook like a small hostage. ...

When LLMs Stop Guessing and Start Calculating

A simulation job does not care how elegant the prompt was. It cares whether the input files are valid, whether the parameters are compatible, whether the previous step produced the right intermediate state, whether the solver converged, and whether the final number actually means what the workflow says it means. This is where the romance of “AI scientists” usually meets the concrete wall of scientific computing. The model can sound like a postdoc. The machine still wants the correct INCAR tag. ...

Packing a Punch: How Model‑Based AI Outperformed Decades of Sphere‑Packing Theory

Packing a Punch: How Model-Based AI Outperformed Decades of Sphere-Packing Theory Expensive experiments have a nasty habit: they punish enthusiasm. In many AI success stories, the hidden luxury is cheap feedback. Generate a million candidates, test them quickly, keep the survivors, call it discovery. This is the comfortable world of many coding benchmarks, puzzle solvers, and evolutionary search systems. It is not the world of high-precision semidefinite programming, where evaluating one candidate can take days and where a bad guess does not merely waste a GPU minute; it quietly burns a serious slice of the research budget. ...

Peer Review Meets Power Tools: How AI Is Quietly Rewriting Scientific Workflows

Peer Review Meets Power Tools: How AI Is Quietly Rewriting Scientific Workflows Research begins with a familiar nuisance: too many papers, too little time, and a creeping suspicion that the most relevant idea is hiding three fields away under someone else’s terminology. Then comes the second nuisance: even after finding the idea, someone must turn it into a hypothesis, a collaborator list, an experiment plan, a protocol, a result, a reviewable claim, and eventually a publishable manuscript. ...