Reproducibility

Zero Degrees, Still Feverish: Why Deterministic AI Needs a Thermometer

Opening — Why this matters now The comforting myth of enterprise AI is that setting an LLM’s temperature to zero makes it deterministic. A nice little checkbox. A procedural sedative. Press it, and the machine behaves. The paper Introducing Background Temperature to Characterise Hidden Randomness in Large Language Models is useful because it attacks that myth directly. Its central claim is not that LLMs are chaotic by nature. That would be dramatic, and therefore probably a conference keynote. The claim is sharper: even when a model is asked to decode at $T = 0$, the surrounding inference environment can introduce enough tiny numerical variation to produce divergent outputs.1 ...

OpenRad or Open Chaos? Cleaning Up Radiology AI’s Model Mess

Models are easy to announce. They are harder to find, harder to reuse, and much harder to trust. That is the uncomfortable starting point for radiology AI. The field is not suffering from a shortage of algorithms. It has models for lesion detection, segmentation, image reconstruction, report generation, modality-specific classification, and increasingly fashionable foundation-style systems. The difficulty begins one step later, when someone asks a boring but lethal operational question: Where is the model, what does it actually do, and can we use it without conducting an archaeological expedition through GitHub, supplementary PDFs, broken links, and optimistic abstracts? ...

Stable World Models, Unstable Benchmarks: Why Infrastructure Is the Real Bottleneck

A robot does not fail politely. It does not say, “I was trained on a slightly different shade of blue.” It just misses the object, pushes the wrong way, or confidently follows a plan that only works in the tidy little universe where the benchmark was born. That is the uncomfortable lesson behind stable-worldmodel-v1, a paper that is less about inventing a new world model and more about asking whether world-model research has been measuring the right thing in the first place.1 ...

Benchmarks on Quicksand: Why Static Scores Fail Living Models

A benchmark score looks wonderfully solid until the model changes, the dataset changes, the deployment stack changes, the GPU behaves differently, the logging pipeline drops half the useful metadata, and someone asks whether the result still means anything for their actual application. At that point, the leaderboard number is not wrong. It is worse: it is under-described. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

Procurement meetings have a habit of turning AI agents into theatre. A vendor shows a polished research assistant. It finds papers, writes a summary, cites sources, maybe generates a small experiment plan. Everyone nods. Someone says “agentic workflow.” Someone else says “autonomous discovery.” A budget appears. The machine is declared practically scientific, which is convenient, because the machine itself has not yet been asked to survive the boring parts of science: retrieval under controlled conditions, code execution, data analysis, experimental reproduction, hypothesis testing, and the small matter of completing all required steps without wandering into the digital bushes. ...

Automate All the Things? Mind the Blind Spots

A research report lands on your desk. It has a neat abstract, respectable tables, clean code attached, and just enough methodological language to sound like someone suffered through the usual academic rituals. Except this time, no one did. An AI scientist system generated the idea, wrote the code, ran the experiments, selected the result, and drafted the paper. ...

From PDF to PI: Turning Papers into Productive Agents

Every R&D team has a shelf of papers that are theoretically useful and practically booby-trapped. The abstract is promising. The method is relevant. The results look transferable. Then reality arrives wearing a conda error message: the repository has three setup paths, two notebooks, one undocumented dependency, and a tutorial that assumes you already know the answer. The paper has been published. The method has not, in any serious operational sense, been delivered. ...