Transformer-Circuits

TL;DR for operators The paper is a useful slap on the wrist for anyone tempted to turn an interpretability result into an operational control too quickly.1 It asks a simple question: when an attention head looks important, contains readable information, and can restore model behaviour after ablation, does that mean it carries a transferable representation of the computation? ...