Research Automation

FIRE-BENCH: Playing Back the Tape of Scientific Discovery

A demo can make an AI research agent look impressive in ten minutes. Give it a task, watch it create files, install packages, run experiments, generate tables, and write something that sounds like a conclusion. Productivity theater, now with terminal logs. The harder question is less cinematic: did it actually discover the right thing? ...

When Papers Learn to Draw: AutoFigure and the End of Ugly Science Diagrams

A diagram is often where a paper stops being private reasoning and becomes public knowledge. Before that point, the author may have a method, a theorem, a pipeline, or a system architecture. The reader has only paragraphs. Then one good figure appears, and the fog lifts. The method has stages. The variables have roles. The arrows tell us what depends on what. The paper becomes less of a swamp. ...

Infinite Tasks, Finite Minds: Why Agents Keep Forgetting—and How InfiAgent Cheats Time

A report is not finished because the model “understands” the assignment. It is finished because the system still knows, two hundred actions later, which documents were read, which notes were trustworthy, which sections remain unfinished, and which half-baked intermediate answer should not accidentally become the final one. That is the boring part of agentic AI. Naturally, it is also the part most systems quietly fail at. ...

Causality, But Make It Massive: How DEMOCRITUS Turns LLM Chaos into Coherent Causal Maps

Maps are useful because they are not the territory. Nobody opens Google Maps and assumes the blue line has physically repaired the road. Sensible people use it to orient themselves, notice routes, avoid obvious mistakes, and decide where to inspect more carefully. That is the cleanest way to read DEMOCRITUS, the system described in Large Causal Models from Large Language Models.1 It does not make LLMs magically perform causal inference. It does not estimate treatment effects. It does not solve confounding. It does not turn a pile of text into scientific truth by sprinkling geometry on top, though that would be a very efficient way to sell consulting decks to executives with poor impulse control. ...

Error 404: Peer Review Not Found — How LLMs Are Quietly Rewriting Scientific Quality Control

Deadline. That is the simplest way to understand why modern AI papers contain mistakes. Not because researchers suddenly forgot algebra. Not because reviewers are lazy. Not because the field has collectively decided that proofs are decorative furniture. The more boring explanation is also the more important one: the AI publication machine has scaled faster than the quality-control machinery around it. ...

Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework

Agents Without Prompts: When LLMs Finally Learn to Check Their Own Homework Instructions are usually treated as the beginning of an AI workflow. A user, developer, or system designer writes a prompt. The model produces an output. Then, if the output looks wrong, someone writes another prompt telling the model how to check it, another prompt telling it how to repair it, and eventually a small mountain of prompt glue accumulates around what was supposed to be an automated system. ...

Diversity Pays: Why AI Research Agents Need More Than One Good Idea

Budget has a way of making AI agents less magical. On a slide, an AI research agent looks like a neat loop: read the task, propose an idea, write code, run an experiment, improve, repeat. In production, it looks more like a slightly caffeinated junior researcher with terminal access: sometimes brilliant, sometimes stubborn, and occasionally determined to spend four hours failing at the same doomed approach because the first idea sounded respectable. ...

The Benchmark Awakens: AstaBench and the New Standard for Agentic Science

Procurement meetings have a habit of turning AI agents into theatre. A vendor shows a polished research assistant. It finds papers, writes a summary, cites sources, maybe generates a small experiment plan. Everyone nods. Someone says “agentic workflow.” Someone else says “autonomous discovery.” A budget appears. The machine is declared practically scientific, which is convenient, because the machine itself has not yet been asked to survive the boring parts of science: retrieval under controlled conditions, code execution, data analysis, experimental reproduction, hypothesis testing, and the small matter of completing all required steps without wandering into the digital bushes. ...

Automate All the Things? Mind the Blind Spots

A research report lands on your desk. It has a neat abstract, respectable tables, clean code attached, and just enough methodological language to sound like someone suffered through the usual academic rituals. Except this time, no one did. An AI scientist system generated the idea, wrote the code, ran the experiments, selected the result, and drafted the paper. ...

From PDF to PI: Turning Papers into Productive Agents

Every R&D team has a shelf of papers that are theoretically useful and practically booby-trapped. The abstract is promising. The method is relevant. The results look transferable. Then reality arrives wearing a conda error message: the repository has three setup paths, two notebooks, one undocumented dependency, and a tutorial that assumes you already know the answer. The paper has been published. The method has not, in any serious operational sense, been delivered. ...