Research Automation

Laws and Order: Turning LLM Brainstorming into a Research Hypothesis Workflow

Brainstorming Is Cheap; Research Judgment Is Not Brainstorming with an LLM is easy. Ask for ten research ideas, wait a few seconds, and receive a confident menu of things that sound just plausible enough to be dangerous. Turn up the temperature and the machine becomes “creative.” Wonderful. We have successfully automated the whiteboard intern. ...

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery Lab meeting. A scientist has a topic, a research question, and not much else. No dataset yet. No final chart. No results section quietly waiting in the appendix. Just a question and the uncomfortable business of guessing what nature might do. ...

Peer Pressure: AI Reviewers Pass the Item Test, Not the Replacement Test

Review is a strange business process. The visible output is a verdict: accept, reject, revise, approve, block, escalate. The useful output is usually smaller and more annoying: one specific criticism that is correct, important, and supported by evidence. That distinction is where the new paper On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists becomes more interesting than the usual “can AI replace reviewers?” theatre.1 The paper does not ask whether an AI reviewer can imitate a human reviewer’s overall score. It asks whether each individual criticism is any good. ...

Vibe Check: AutoResearch Is a Workflow, Not a Robot Scientist

Demo day is not discovery day Demo day has a familiar rhythm. An AI system reads papers, proposes an idea, edits code, runs an experiment, drafts a manuscript, and perhaps even produces something that looks suspiciously like a conference submission. The slide title then arrives with great ceremony: autonomous scientist. The paper AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery is useful because it interrupts that ceremony before everyone starts clapping at the PDF generator.1 Its central move is not to deny progress. Current systems really can automate meaningful pieces of research work. They can search, summarize, plan, code, run tools, assemble figures, and draft reports. That is already operationally important. ...

Silent Errors, Loud Consequences: ASMR-Bench and the Coming Era of AI Auditors

Code review is supposed to be the sober adult in the room. A researcher writes code. A reviewer checks the code. A suspicious bug gets caught before it becomes a chart, a memo, a product decision, or—if everyone is having a particularly expensive week—a board presentation. That model works reasonably well when the failure is accidental and the reviewer has more patience than the author. It becomes less reassuring when the author is an AI research agent, the codebase is messy, the experiment is expensive to rerun, and the suspicious line looks less like a bug than a perfectly normal design choice. ...

When AI Starts Writing Papers: The Rise of the Medical AI Scientist

Papers used to have a useful quality: they were difficult to produce. Not always good, unfortunately, but difficult. Someone had to identify a problem, read the literature, design the method, write the code, run the experiment, repair the code, compare the result, draw the figures, write the manuscript, and then survive peer review with only minor emotional damage. ...

From Pipelines to Research Brains: The Rise of AI-Supervised Science

Memory is the boring word that decides whether an AI agent is useful or merely theatrical. A familiar business scene: a team builds an AI workflow to scan documents, generate ideas, produce drafts, and recommend next actions. The demo looks clever. The first week feels magical. Then the cracks appear. The system repeats discarded ideas. It forgets why an option was rejected. It summarizes a project but cannot explain how one failure in March should change a decision in April. Its “memory” is really a longer chat transcript wearing a lab coat. ...

Autoresearch²: When AI Starts Debugging Its Own Brain

Search is where many AI systems become embarrassingly human. They try one move. It fails. They try a nearby move. It fails. Then, with the serene confidence of a spreadsheet macro wearing a lab coat, they try the first move again. That is the real problem behind many “autonomous research” demonstrations. The issue is not always that the model cannot propose useful ideas. It is that the loop around the model is fixed: propose a change, run an experiment, evaluate the result, keep or discard. Once this loop gets stuck, the system often has no way to ask the more important question: is my search process itself badly designed? ...

LemmaBench: When AI Finally Meets Real Mathematics

Most AI math benchmarks still feel like exam rooms. The model receives a problem. It produces an answer. We score the answer. Everyone argues about whether the problem was hard enough, whether the model saw something similar during training, and whether the leaderboard means anything outside the leaderboard. Very productive. Almost as peaceful as a faculty meeting. ...

First Proofs, No Training Wheels

Proof is where AI systems stop performing confidence and start owing the reader money. A model can restate a theorem elegantly. It can cite the right neighborhood of literature. It can produce LaTeX with the visual manners of a publishable paper. None of that is a proof. It is proof-shaped material. Sometimes useful. Sometimes impressive. Sometimes a very expensive fog machine. ...