
Circuits of Understanding: A Formal Path to Transformer Interpretability
Can we prove that we understand how a transformer works? Not just describe it heuristically, or highlight patterns—but actually trace its computations with the rigor of a math proof? That’s the ambition behind the recent paper Mechanistic Interpretability for Transformers: A Formal Framework and Case Study on Indirect Object Identification. The authors propose the first comprehensive mathematical framework for mechanistic interpretability, and they use it to dissect how a small transformer solves the Indirect Object Identification (IOI) task. What results is not just a technical tour de force, but a conceptual upgrade for the interpretability field. ...