Cover image

Circuits of Understanding: A Formal Path to Transformer Interpretability

TL;DR for operators Debugging. That is the useful mental entry point, not “AI transparency,” which has become a conference badge phrase with slightly better lighting. The paper at the centre of this article, Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small, shows that a real linguistic behaviour in a transformer can be decomposed into a circuit of internal components, then tested using causal interventions rather than admired through colourful attention maps.1 The task is indirect object identification: given a sentence where two names appear and one is repeated, the model predicts the other name. Small grammar problem, large interpretability bill. ...

July 30, 2025 · 14 min · Zelina
Cover image

Steering by the Token: How GRAINS Turns Attribution into Alignment

TL;DR for operators GRAINS is not “fine-tuning, but cheaper.” That framing misses the point and commits the usual business sin of turning a mechanism into a procurement slogan. The paper’s useful claim is more specific: token-level attribution can be converted into an inference-time steering signal. Instead of retraining model weights, GrAInS identifies which text or image tokens most strongly push the model toward preferred or dispreferred outputs, builds layer-wise steering vectors from those activation shifts, and applies normalized edits during inference.1 ...

July 26, 2025 · 16 min · Zelina