Mechanistic Interpretability

No Structure, No Glory: Why AI Cognition Has to Be Shown, Not Named

TL;DR for operators AI systems are now sold with labels that sound increasingly cognitive: reasoning, planning, agency, memory, autonomy, sometimes even the more theatrical hints of machine consciousness. Lovely. The marketing department has discovered philosophy. The useful question is not whether the label feels exciting. It is whether the system realizes an internal organization that could actually support the claimed capability. ...

Binding Obligations: Why AI Fails When the Relationships Slip

TL;DR for operators AI systems are getting better at producing outputs that look structured: code, CAD, diagrams, workflows, compliance memos, procurement recommendations, and decision traces. That is not the same as keeping the structure right. Two recent arXiv papers make this point from opposite ends of the problem. One looks inside language models and finds evidence for a compact retrieval-conditioned rebinding mechanism: the model does not necessarily rewrite its whole internal world after a state change; it can preserve old representations and redirect retrieval when the answer is needed.1 The other builds an engineering benchmark for Text-to-CAD and shows that models can pass earlier surface gates — executable code, plausible geometry — while still failing the practical tests of functionality, manufacturability, and assemblability.2 ...

Heads You Lose: Why Ablation-Reversible Interpretability Doesn’t Transfer

TL;DR for operators The paper is a useful slap on the wrist for anyone tempted to turn an interpretability result into an operational control too quickly.1 It asks a simple question: when an attention head looks important, contains readable information, and can restore model behaviour after ablation, does that mean it carries a transferable representation of the computation? ...

Lie Detectors Are Late: Why AI Oversight Needs Commitment Tracing

Sales agents, investment advisors, negotiators, and procurement bots share one annoying trait: the dangerous moment often arrives before the final sentence. By the time the agent says, “This product is ideal for your risk profile,” or “We have a stronger competing offer,” the operational system has already lost the more interesting battle. The model did not become risky at the punctuation mark. It drifted, selected a path, rationalized a move, and only then produced the polished message that everyone pretends to audit. ...

Less Chain, More Thought: The Coming Control Layer for LLM Reasoning

Less Chain, More Thought: The Coming Control Layer for LLM Reasoning Enterprise AI has spent the last two years discovering a mildly inconvenient truth: a model that explains itself at length is not necessarily reasoning well. It may be reasoning. It may be narrating. It may also be producing a confident procedural bedtime story with a spreadsheet attached. ...

Follow the Heads, Not the Hype: How LLMs Route Deductive Reasoning

Policy rules are boring until a chatbot applies the wrong one. A customer asks whether they qualify for a refund. The rule says refunds require purchase within 30 days, unused condition, and no prior replacement claim. The model answers confidently. It even writes a neat step-by-step explanation. Wonderful. The explanation looks like reasoning. It may even be correct. ...

High Entropy, Low Drama: The Internal Fingerprint of LLM Reasoning

Scores are comforting. They fit neatly into leaderboards, procurement decks, and internal model-comparison spreadsheets. One model gets 71.5, another gets 72.9, and someone in the meeting says, “So the second one reasons better.” Maybe. Or maybe the model merely passed a particular checkpoint more often. That is useful, but it is not the same as knowing whether the model has learned a controllable reasoning process. A thermometer tells you the patient is hot; it does not explain the infection. Benchmarks are the thermometer. The paper Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models tries to look for something closer to the infection mechanism — or, less dramatically, the internal process signature behind “slow thinking” in large reasoning models.1 ...

Same Maps, Different Moves: Why LLMs Can Converge Without Understanding

Meetings are useful theatre. Everyone can nod at the same slide, repeat the same market keywords, and still leave the room with incompatible plans. The agreement was real. The shared understanding was not. Large language models may be doing something uncomfortably similar. The paper Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning studies whether models that look similar internally are actually reasoning in similar ways.1 This matters because a tempting story has been building around representational convergence: as models scale, their internal representations become more alike, perhaps because they are converging toward a shared statistical model of reality. That story is elegant. It is also a little too convenient, which is usually where expensive mistakes begin. ...

Follow the Heads, Not the Hype: How LLMs Route Deductive Reasoning

A compliance bot does not fail only when it gives the wrong final answer. It can fail earlier, in a quieter and more expensive place: it selects the wrong premise, stops collecting evidence too soon, matches the wrong rule, and then writes a perfectly fluent explanation of a decision that was already broken three steps ago. Very elegant. Very useless. ...

Turning Heads: Why AI Still Gets Lost When It Turns Around

A room is a cruelly simple test for artificial intelligence. Put a person inside it. Tell them they are facing an avocado. Ask them to turn right by 270 degrees, then left by 90 degrees. Give them a few observations along the way. After the final turn, ask what they can see. ...