Cover image

Wrong on Purpose: FalsifyBench and the Agent Skill We Keep Forgetting

A good analyst should occasionally try to break their own idea. Not performatively. Not with a decorative “on the other hand” paragraph. Actually break it. Ask the kind of question that could make the current hypothesis collapse, then watch whether the evidence forces a better one. That simple discipline is the center of FalsifyBench: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games, a new paper by Leonardo Bertolazzi, Katya Tentori, and Raffaella Bernardi.1 The paper is framed around scientific reasoning, but its practical message travels well beyond science. If an AI agent cannot test outside its own current belief, it may look careful while doing something much less impressive: confirming the first plausible story it invented. ...

June 8, 2026 · 17 min · Zelina
Cover image

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery

Clue by Clue: ProjectionBench and the Business of Testing AI Discovery Lab meeting. A scientist has a topic, a research question, and not much else. No dataset yet. No final chart. No results section quietly waiting in the appendix. Just a question and the uncomfortable business of guessing what nature might do. ...

June 3, 2026 · 16 min · Zelina
Cover image

Vibe Check: AutoResearch Is a Workflow, Not a Robot Scientist

Demo day is not discovery day Demo day has a familiar rhythm. An AI system reads papers, proposes an idea, edits code, runs an experiment, drafts a manuscript, and perhaps even produces something that looks suspiciously like a conference submission. The slide title then arrives with great ceremony: autonomous scientist. The paper AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery is useful because it interrupts that ceremony before everyone starts clapping at the PDF generator.1 Its central move is not to deny progress. Current systems really can automate meaningful pieces of research work. They can search, summarize, plan, code, run tools, assemble figures, and draft reports. That is already operationally important. ...

June 3, 2026 · 17 min · Zelina
Cover image

When AI Starts Writing Papers: The Rise of the Medical AI Scientist

Papers used to have a useful quality: they were difficult to produce. Not always good, unfortunately, but difficult. Someone had to identify a problem, read the literature, design the method, write the code, run the experiment, repair the code, compare the result, draw the figures, write the manuscript, and then survive peer review with only minor emotional damage. ...

March 31, 2026 · 16 min · Zelina
Cover image

From Pipelines to Research Brains: The Rise of AI-Supervised Science

Memory is the boring word that decides whether an AI agent is useful or merely theatrical. A familiar business scene: a team builds an AI workflow to scan documents, generate ideas, produce drafts, and recommend next actions. The demo looks clever. The first week feels magical. Then the cracks appear. The system repeats discarded ideas. It forgets why an option was rejected. It summarizes a project but cannot explain how one failure in March should change a decision in April. Its “memory” is really a longer chat transcript wearing a lab coat. ...

March 26, 2026 · 15 min · Zelina
Cover image

Strings Attached: When AI Starts Solving Physics

Mistakes are cheap now. That is both the promise and the problem of modern AI research. A large language model can produce a plausible derivation, a plausible proof, a plausible business plan, and a plausible explanation of why the previous three are brilliant. This is useful, until one remembers that “plausible” is the favorite costume of “wrong.” ...

March 8, 2026 · 14 min · Zelina
Cover image

When LLMs Learn Physics: Taming Symbolic Regression in Materials Science

Formula discovery sounds like the part of science where artificial intelligence should behave like a heroic mathematician: stare at data, discover a law, and write down a clean equation while everyone else politely applauds. That is the cinematic version. The actual engineering problem is less glamorous and much more useful. Symbolic regression already searches for equations. Given enough variables, operators, constants, and patience, it can produce formulas that fit data. The trouble is that “fits data” and “means something physically” are not the same sentence. In a high-dimensional materials dataset, symbolic regression can wander through a forest of plausible-looking algebra and return a formula that is accurate, ornate, and scientifically suspicious. A spreadsheet can also produce a trendline. We do not usually call that physics. ...

March 1, 2026 · 16 min · Zelina
Cover image

AIRS-Bench: When AI Starts Doing the Science, Not Just Talking About It

A benchmark is supposed to be a ruler. In AI, it often becomes a trophy shelf. A model gets a higher score, a chart moves up and to the right, and everyone politely pretends the hard part has been settled. That ritual works when the task is narrow: classify an image, answer a question, pass a coding test, retrieve a document. But it becomes much less comforting when the system being evaluated is no longer just answering. It is planning experiments, writing code, debugging failures, training models, interpreting results, and deciding what to try next. ...

February 9, 2026 · 19 min · Zelina
Cover image

FIRE-BENCH: Playing Back the Tape of Scientific Discovery

A demo can make an AI research agent look impressive in ten minutes. Give it a task, watch it create files, install packages, run experiments, generate tables, and write something that sounds like a conclusion. Productivity theater, now with terminal logs. The harder question is less cinematic: did it actually discover the right thing? ...

February 5, 2026 · 14 min · Zelina
Cover image

Learning to Discover at Test Time: When Search Learns Back

A leaderboard usually treats an AI model like a very fast intern: give it a problem, let it try many times, keep the best answer, and politely ignore the graveyard of failed attempts. That is useful. It is also a little strange. A human engineer does not merely try 25,600 variations of a GPU kernel while keeping the same brain. After the first few failures, she learns which bottlenecks matter. After a lucky partial success, she changes how she thinks about the problem. After enough attempts, the search process is no longer just sampling. It has become learning. ...

January 24, 2026 · 18 min · Zelina