When Models Know But Won’t Act: The Interpretability Illusion
Opening — Why this matters now There is a quiet assumption baked into most AI governance frameworks: if we can see what a model is thinking, we can fix it when it goes wrong. It’s a comforting idea. Regulators like it. Engineers build tooling around it. Consultants sell it. Unfortunately, this paper demonstrates something far less convenient: models can know the right answer internally—and still fail to act on it. ...