Cover image

The Chain of Thought Needs a Chain of Custody

TL;DR for operators Two new papers point to the same operational lesson from different sides: long reasoning becomes useful only when its intermediate steps are made explicit, scoped, and checkable. HIPIF tackles the training side of long-horizon agents: it teaches an LLM agent to break tasks into subgoals, fold completed progress into compact memory, reflect on whether a subgoal is done, and use local process rewards to reduce repeated or ungrounded behavior.1 Mask-Proof tackles the evaluation side: it turns research-level mathematical proofs into masked-step tasks where a model must reconstruct a critical formula from self-contained context, then uses a semantic-equivalence judge with repeated voting to grade the result.2 ...

June 23, 2026 · 21 min · Zelina
Cover image

Do the Math, Not the Mime: Why LLM Reasoning Needs a Verification Pipeline

A spreadsheet error rarely announces itself with dramatic music. It usually arrives politely. A pricing model gives a clean answer. A compliance calculator writes a confident explanation. A financial assistant produces a neat derivation with enough intermediate steps to look reassuring. The result is formatted, fluent, and possibly wrong. That is the uncomfortable business lesson behind Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges, a 2026 survey of roughly 120 studies on LLM mathematical reasoning.1 The paper is not introducing one new benchmark, one heroic model, or one more leaderboard trophy to place on the already overcrowded mantelpiece. Its useful contribution is more structural: it connects datasets, representations, training methods, tool use, verifiers, and evaluation metrics into one reasoning pipeline. ...

May 31, 2026 · 14 min · Zelina
Cover image

Think Longer, Act Smarter: Why Coding Agents Need Behavior-Preserving Reasoning

A coding agent can fail in two very different ways. One failure is obvious: it does not think enough. It sees an error report, guesses the wrong file, edits too early, and then spends the rest of the trajectory debugging its own mistake. Anyone who has watched an autonomous coding agent wander through a repository has seen this little tragedy. The machine is busy, but not necessarily useful. ...

May 31, 2026 · 16 min · Zelina
Cover image

Do the Math, Not the Mime: Why LLM Reasoning Needs a Verification Pipeline

Spreadsheet errors have a special talent: they look boring until they become expensive. That is the business version of the LLM math problem. A model can produce a calm, step-by-step explanation, put a confident number at the bottom, and still be wrong in the only place that matters. Worse, the reasoning may look plausible enough that a manager, analyst, tutor, or compliance reviewer nods and moves on. The answer has the rhythm of thinking. It has the costume of calculation. It may even have a chain-of-thought trace. Very civilized. Still not proof. ...

May 30, 2026 · 19 min · Zelina
Cover image

Think Longer, Act Worse? What M2A Teaches About Reasoning Agents

Think Longer, Act Worse? What M2A Teaches About Reasoning Agents A coding agent does not fail only because it cannot think. Sometimes it fails because it keeps thinking after it should inspect the repository. Sometimes it writes a plausible explanation before checking the relevant file. Sometimes it burns the context window by wandering through hypotheses, each one almost reasonable, none of them decisive. The result is not stupidity in the familiar sense. It is a coordination failure: the model does not know when to reason, when to call a tool, when to absorb feedback, and when to edit. ...

May 29, 2026 · 15 min · Zelina
Cover image

Judge Math-Not by Its Parser

Opening — Why this matters now The AI industry has discovered a wonderfully pedestrian way to misread progress: build models that can solve harder math problems, then grade them with evaluators that panic when 2040 minutes is not written as 34 hours. That is not a joke. It is the central irritation behind “Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity”, an arXiv paper that examines how mathematical reasoning benchmarks can be distorted by rigid symbolic verification.1 ...

April 27, 2026 · 12 min · Zelina
Cover image

When AI Can Solve But Can't Search: The MathNet Equation

Search. That is the unglamorous part of AI work. The demo asks a model to solve a clean problem. The enterprise system asks a model to find the right prior case, retrieve the relevant precedent, avoid the misleading near-match, and then adapt the answer without making a confident mess of it. MathNet is interesting because it puts that distinction under pressure. The paper introduces a large multilingual, multimodal Olympiad mathematics benchmark, but the more useful business lesson is not merely that frontier models can solve hard math. We already have enough leaderboards wearing medals. The sharper finding is that models and embedding systems can still fail at recognizing when two problems are mathematically the same, or when one problem is structurally useful for another.1 ...

April 23, 2026 · 13 min · Zelina
Cover image

LemmaBench: When AI Finally Meets Real Mathematics

Most AI math benchmarks still feel like exam rooms. The model receives a problem. It produces an answer. We score the answer. Everyone argues about whether the problem was hard enough, whether the model saw something similar during training, and whether the leaderboard means anything outside the leaderboard. Very productive. Almost as peaceful as a faculty meeting. ...

March 2, 2026 · 17 min · Zelina
Cover image

First Proofs, No Training Wheels

Proof is where AI systems stop performing confidence and start owing the reader money. A model can restate a theorem elegantly. It can cite the right neighborhood of literature. It can produce LaTeX with the visual manners of a publishable paper. None of that is a proof. It is proof-shaped material. Sometimes useful. Sometimes impressive. Sometimes a very expensive fog machine. ...

February 7, 2026 · 15 min · Zelina
Cover image

When Coders Prove Theorems: Agents, Lean, and the Quiet Death of the Specialist Prover

A coder does not trust a program because it sounds plausible. A coder runs it, reads the error message, changes the implementation, tests again, searches the library, asks a colleague, splits the problem, and keeps going until the machine stops complaining. That mundane loop is the interesting part of Numina-Lean-Agent: An Open and General Agentic Reasoning System for Formal Mathematics.1 The headline result is easy to market: with Claude Opus 4.5 as the base model, Numina-Lean-Agent solves all 12 Putnam 2025 problems in Lean, matching the reported perfect score of AxiomProver. Nice. The trophy cabinet sparkles. ...

January 21, 2026 · 20 min · Zelina