Scientific Reasoning

The Solver Was Fine. The Premises Got Lost.

TL;DR for operators SciR is a benchmark for a problem that enterprise AI teams keep trying to flatten into one metric: can a model reason scientifically?1 The more useful question is less flattering and more operational: did the model fail because it could not infer the answer, or because it could not recover the premises from the scientific mess placed in front of it? ...

Laws and Order: Turning LLM Brainstorming into a Research Hypothesis Workflow

Brainstorming Is Cheap; Research Judgment Is Not Brainstorming with an LLM is easy. Ask for ten research ideas, wait a few seconds, and receive a confident menu of things that sound just plausible enough to be dangerous. Turn up the temperature and the machine becomes “creative.” Wonderful. We have successfully automated the whiteboard intern. ...

Code That Thinks, Models That Don’t: What SymPyBench Reveals About LLM Scientific Reasoning

Calculator. That is the boring object hiding inside many “AI reasoning” debates. In technical work, the uncomfortable question is not whether a language model can explain a formula with academic confidence. It is whether the model can still get the answer right after the numbers change, the wording shifts, the unit conversion becomes annoying, and no multiple-choice option politely waves from the corner saying, “Pick me.” ...

Scientific Reasoning Under the Microscope: How PRiSM Stress-Tests the New Generation of Multimodal Models

Scientific Reasoning Under the Microscope: How PRiSM Stress-Tests the New Generation of Multimodal Models Grades are comforting. A model solves 80% of the benchmark, the leaderboard smiles, the demo team relaxes, and someone in procurement quietly starts asking whether the engineering team still needs that many humans. This is usually the part where reality coughs politely. ...