Mechanistic Interpretability

TL;DR for operators Debugging. That is the useful mental entry point, not “AI transparency,” which has become a conference badge phrase with slightly better lighting. The paper at the centre of this article, Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, shows that a real linguistic behaviour in a transformer can be decomposed into a circuit of internal components, then tested using causal interventions rather than admired through colourful attention maps.1 The task is indirect object identification: given a sentence where two names appear and one is repeated, the model predicts the other name. Small grammar problem, large interpretability bill. ...

TL;DR for operators Dates look harmless. They sit in spreadsheets, contracts, forecasts, audit trails, delivery plans, and board decks pretending to be objective little integers. The problem is that a language model may not treat them as just integers. A new paper, The Other Mind: How Language Models Exhibit Human Temporal Cognition, studies how 12 large language models judge similarity between years from 1525 to 2524.1 The authors find that larger models often organise years around a subjective reference point near the recent present, rather than simply comparing numerical distance. The models also show logarithmic compression: years farther from that reference point become less finely distinguished, in a pattern reminiscent of the Weber-Fechner law in human perception. ...

Mechanistic Interpretability

Circuits of Understanding: A Formal Path to Transformer Interpretability

The Clock Inside the Machine: How LLMs Construct Their Own Time