Inference-Time Scaling

Process Reward Agents — When Reasoning Learns to Judge Itself (Before It’s Too Late)

Reasoning systems have a familiar failure mode: they can sound calm while quietly walking off a cliff. A model begins with a plausible assumption, adds a second plausible sentence, then a third. By the time the final answer arrives, the mistake is no longer obvious because it has been wrapped in a competent-looking explanation. In low-stakes writing, this is annoying. In medicine, finance, compliance, or legal reasoning, it is a process failure masquerading as intelligence. ...

Mind Your Mode: Why One Reasoning Style Is Never Enough

Enterprise workflows rarely fail because nobody “thought step by step.” They fail because the wrong kind of thinking is applied for too long. A compliance analyst does not review an incident report the same way she reconciles a spreadsheet. A software engineer does not debug production latency with the same mindset used to design a product roadmap. A CFO does not evaluate a warehouse automation proposal by “being creative” all the way through, unless the board has a strong appetite for interpretive dance. ...

Thinking in Branches: Why LLM Reasoning Needs an Algorithmic Theory

A manager asks an AI system for a risk assessment. It gives a plausible answer. The manager asks again with a slightly different prompt. Another plausible answer appears, with different reasoning. Ask five more times and the system scatters clues across the attempts like a consultant who has read the documents but refuses to assemble the memo in one draft. ...

Many Minds Make Light Work: Boosting LLM Physics Reasoning via Agentic Verification

TL;DR for operators A familiar enterprise AI failure looks like this: the model gives a confident answer, the formatting is exquisite, the explanation sounds like a gifted teaching assistant, and one equation quietly takes the project into a ditch. Physics is an unusually good place to study that failure because being clear is not enough. The system must interpret the situation, select the right principle, keep the units straight, calculate correctly, and not hallucinate a helpful-but-illegal assumption because the prompt looked lonely. ...

$Cover image$

Tool Up or Tap Out: How Multi-TAG Elevates Math Reasoning with Smarter LLM Workflows

TL;DR for operators Most tool-using LLM workflows still behave like an intern with a favourite spreadsheet: they call one tool, trust the result, and hope the formatting does not catch fire. Multi-TAG proposes a more disciplined pattern. At each reasoning step, the model does not simply choose between chain-of-thought, Python, or WolframAlpha. It asks several tool-backed executors to propose candidate next steps, checks which candidates lead to the same estimated final answer, and then selects the shortest completion among the candidates that agree. That is the useful idea: not “give the model tools,” but “make tools disagree in a controlled way, then use agreement as a verification signal.” ...