Reasoning

Think Fast, Act Faster: How 'Thinking-by-Doing' Is Rewiring LLM World Models

Feedback is addictive. Give an AI agent a tool, an API, a database, a browser, a simulator, or a workflow environment, and the temptation is obvious: let it keep poking the world until something works. It tries. It observes. It corrects. It tries again. Compared with a model sitting alone in a prompt box, imagining every possible transition in its head, this looks much healthier. Less hallucinated planning, more contact with reality. Very grown-up. ...

Recurrent Revival: How Retrofitted Depth Turns LLMs Into Deeper Thinkers

Compute is the bill that arrives after every AI strategy meeting. Everyone wants stronger reasoning. Fewer hallucinations. Better mathematical reliability. More robust planning. The usual menu is familiar: train a bigger model, sample more answers, generate longer chain-of-thought, bolt on a verifier, or pray to the GPU procurement gods. Elegant, in the way an invoice can be elegant. ...

Backtrack to Breakthrough: Why Great AI Agents Revisit

Search is easy. Knowing when to go back is harder. That is the useful irritation inside GSM-Agent, a new benchmark for studying agentic reasoning under controlled conditions.1 The paper takes grade-school maths problems from GSM8K, removes the premises from the prompt, hides those premises in a searchable document database, and asks an LLM agent to recover the facts before solving the problem. The arithmetic is not supposed to be impressive. That is the point. If a model fails here, we cannot calmly blame differential geometry, PhD-level law, or some mysteriously adversarial enterprise workflow. The agent simply did not find and use the facts. ...

Branching Out of the Box: Tree‑OPO Turns MCTS Traces into Better RL for Reasoning

Branching Out of the Box: Tree-OPO Turns MCTS Traces into Better RL for Reasoning A search tree is expensive to build. Once you have paid for it, using only the final answers is a little like buying an aircraft engine and admiring the packaging. That is the useful instinct behind Tree-OPO, a paper that asks whether Monte Carlo Tree Search traces from a stronger teacher model can be reused not merely as demonstrations, but as a structured curriculum for training a smaller reasoning policy.1 The idea is not to run MCTS at inference time and call that progress. Nor is it to imitate a teacher’s logits until the student develops the personality of a photocopier. The paper’s more interesting move is subtler: take the partial reasoning states produced by search, let the student complete from those prefixes, and compute advantages in a way that respects where each prefix sits in the tree. ...

Razor Burn: Why LLMs Nick Themselves on Induction and Abduction

Diagnosis is where AI systems start to look clever, then suddenly start charging consultancy rates. Give a model a handful of symptoms, incident logs, customer complaints, or audit traces, and ask it what explains them. It will usually produce something plausible. Sometimes several plausible things. Occasionally an entire decorative shrubbery of plausible things. The practical question is not whether the model can invent an explanation. That bar is underground. The harder question is whether it can find the simplest explanation that accounts for the evidence without adding unnecessary machinery. ...

Put It on the GLARE: How Agentic Reasoning Makes Legal AI Actually Think

TL;DR for operators GLARE is useful because it attacks the boring but expensive failure mode in legal AI: the model jumps to the familiar label, decorates the guess with legal-sounding prose, and hopes nobody asks whether a nearby charge would have fit better. The paper proposes an agentic legal judgment prediction framework that does three things in sequence: it expands the set of candidate charges, retrieves precedents with explicit reasoning paths rather than just similar facts, and performs targeted legal search when the model detects a knowledge gap.1 That mechanism matters more than the branding. GLARE is not “RAG, but with legal documents.” It is closer to a small operating procedure for legal reasoning: widen the hypothesis space, compare alternatives, then fetch the missing premise. ...

Knows the Facts, Misses the Plot: LLMs’ Knowledge–Reasoning Split in Clinical NLI

TL;DR for operators A model that can answer clinical fact-checking questions is not necessarily a model that can reason clinically. That is the inconvenient result of The Knowledge-Reasoning Dissociation: Fundamental Limitations of LLMs in Clinical Natural Language Inference, which introduces CTNLI, a controlled clinical NLI benchmark paired with Ground Knowledge and Meta-Level Reasoning Verification probes.1 ...

Thinking in Circles: How Self-Questioning LLMs Learn Without Labels

TL;DR for operators Self-Questioning Language Models, or SQLM, tests a tempting idea: can a language model improve its reasoning ability without being handed a curated training set of questions and answers? The answer in this paper is: partly, in narrow settings, if the training loop is engineered carefully enough.1 The mechanism is not mystical self-awareness. A model is split into two roles. One role proposes questions from a single topic prompt. The other tries to solve them. Reinforcement learning then updates the system using proxy rewards: majority-vote agreement for arithmetic and algebra, and proposer-generated unit tests for coding. The proposer is rewarded for problems that are not too easy and not too hard; the solver is rewarded for answers that pass the available proxy. ...

The Two Minds of Finance: Testing LLMs for Divergence and Discipline

TL;DR for operators Finance teams do not ask AI systems to do one kind of thinking. They ask them to imagine plausible futures, extract investable implications, choose between similar explanations, and avoid being seduced by the prettiest narrative. Those are not the same task. A model can be fluent, plausible, and still strategically dull. Finance has a long tradition of rewarding that, but we do not need to automate the habit. ...

The Retrieval-Reasoning Tango: Charting the Rise of Agentic RAG

TL;DR for operators Static RAG is still useful. It is also no longer the whole game. The paper behind this article argues that retrieval and reasoning are converging into a more tightly coupled architecture: reasoning can improve retrieval, retrieval can improve reasoning, and agentic systems can interleave both over multiple steps.1 That sounds like a neat academic symmetry until you put it inside an enterprise workflow, where every extra retrieval call means latency, cost, permissions, ranking risk, and one more place for the machine to confidently ingest rubbish. ...