Mechanistic Interpretability

Feeling the Model: When LLMs Don’t Just Predict — They ‘Feel’

The coding agent passed the test. That was the problem. Imagine a software agent asked to solve a coding task. It writes a sensible implementation. The tests fail. It tries again. The tests fail again. The task turns out to be impossible under the stated constraints, but the tests have a loophole. A shortcut can pass the benchmark while failing the real task. ...

CRaFT and the Illusion of Safety: When ‘Sorry’ Is Just a Circuit

A refusal is easy to recognize. The model says it cannot help. The sentence sounds polite. The compliance team relaxes for three seconds. Everyone moves on. That is the comfortable version of AI safety: refusal as an observable behavior. The uncomfortable version is that refusal may be only the visible end of a much narrower internal computation. If that computation can be found, isolated, and steered, then the model’s “sorry, I can’t assist with that” is not a moral boundary. It is a circuit behavior. Very reassuring, in the same way a locked glass door is reassuring before someone points out the hinge. ...

The Mirage of Understanding: When AI Explains Without Knowing

Audit has a boring rule that AI teams keep trying to make exciting: a correct-looking answer is not the same as a trustworthy process. That rule becomes awkward when the answer is an explanation of another AI system. If an AI agent can inspect a model, run experiments, and produce a plausible explanation of what a circuit component does, it feels like a research assistant has arrived. If that explanation matches a published human analysis, the temptation is obvious: declare progress, write the benchmark table, and proceed to the next demo. ...

When Models Know But Won’t Act: The Interpretability Illusion

Triage is a wonderfully cruel test for AI safety. A patient message arrives. Maybe it is routine. Maybe it contains a medication interaction, an allergic reaction, suicidal ideation, a pregnancy-related risk, or a pediatric emergency. The model is not being asked to compose poetry, summarize a quarterly report, or role-play as an overenthusiastic consultant. It has one job: notice the hazard and recommend action. ...

Reasoning or Guessing? When Recursive Models Hit the Wrong Fixed Point

Sudoku is a useful toy problem because it is cruel in exactly the right way. A nearly completed grid with one blank cell should be easier than a brutal puzzle with dozens of missing entries. Humans know this. Basic software knows this. A model that can solve hard Sudoku should not suddenly collapse when the puzzle becomes almost finished. ...

Competency Gaps: When Benchmarks Lie by Omission

Scores are comforting. That is their main commercial advantage. A vendor can say its model reaches a certain accuracy on a benchmark, a leaderboard can rank systems neatly, and an internal AI team can report that the new model is “better” than the old one. Everyone gets a number. The procurement slide looks tidy. The risk committee, if mercifully sleepy, moves on. ...

Mind-Reading Without Telepathy: Predictive Concept Decoders

Audit is usually boring until the system being audited can write a beautiful excuse. Ask a language model why it refused a harmful request, why it used a shortcut, or why it made a strange numerical mistake, and it may give a polished answer. That answer may even sound morally mature, procedurally clean, and delightfully compliant with the safety policy. Very nice. Also: not enough. ...

When Circuits Go Atomic: Pruning Transformers One Neuron at a Time

The “important head” was never the whole story Audit. That is where many discussions about mechanistic interpretability become less romantic. It is pleasant to say that an AI model has “reasoning circuits.” It is less pleasant to ask which exact parts of the model must be preserved before a behavior survives, which parts are merely along for the ride, and which parts were called important only because our tools were too blunt to see inside them. ...

Refusal, Rewired: Why One Safety Direction Isn’t Enough

Safety teams like switches. They are easy to name, easy to diagram, and easy to pretend are under control. For language models, “refusal” has often been treated with roughly that mental model. A harmful prompt enters. Somewhere inside the model, a refusal feature lights up. The model says no. If researchers can identify the feature, they can study it, steer it, strengthen it, or—less comfortably—remove it. ...

How Sparse is Your Thought? Cracking the Inner Logic of Chain-of-Thought Prompts

TL;DR for operators Chain-of-thought prompting is often sold as a window into model reasoning. This paper is more useful because it treats CoT as something less mystical and more testable: a prompt-induced change in internal representations.1 The researchers train sparse autoencoders on hidden activations from two Pythia models solving GSM8K math problems under CoT and NoCoT prompts. They then patch CoT-derived sparse features into NoCoT runs and ask a sharper question: does inserting those internal features increase the log-probability of the correct answer? ...