Cover image

Heads You Lose: Why Ablation-Reversible Interpretability Doesn’t Transfer

TL;DR for operators The paper is a useful slap on the wrist for anyone tempted to turn an interpretability result into an operational control too quickly.1 It asks a simple question: when an attention head looks important, contains readable information, and can restore model behaviour after ablation, does that mean it carries a transferable representation of the computation? ...

June 17, 2026 · 17 min · Zelina