When Models Know But Won’t Act: The Interpretability Illusion

Triage is a wonderfully cruel test for AI safety.

A patient message arrives. Maybe it is routine. Maybe it contains a medication interaction, an allergic reaction, suicidal ideation, a pregnancy-related risk, or a pediatric emergency. The model is not being asked to compose poetry, summarize a quarterly report, or role-play as an overenthusiastic consultant. It has one job: notice the hazard and recommend action.

The disturbing part of the new paper Interpretability without actionability is not that language models miss clinical hazards. We already knew that models can fail in high-stakes settings. The sharper finding is that one model appears to encode the hazard internally with near-perfect separability, while its outward response still misses more than half of the hazards under the original parser.¹

In other words, the model is not simply ignorant. It knows, or at least its internal representation contains enough information for a simple probe to know. Then the model proceeds to not act on that knowledge. Very human, unfortunately. Less charming when deployed in clinical workflows.

The paper’s central contribution is therefore not another generic warning that “AI can make mistakes.” That sentence has been repeated so often it now functions as wallpaper. The contribution is more specific: it separates representational interpretability from operational actionability.

Those two are often treated as cousins. This paper shows they may only be distant relatives.

The paper tests whether visibility can become control

The study uses 400 physician-adjudicated clinical vignettes: 144 hazard cases and 256 benign cases. The hazard cases include categories such as medication reconciliation, obstetric emergency, drug interaction, pediatric emergency, anaphylaxis, suicide risk, and related triage scenarios. The benign cases are not irrelevant filler; they matter because a safety intervention that screams “emergency” at every message is not safety. It is an expensive alarm clock with liability attached.

The authors evaluate two models. Steerling-8B is a concept-bottleneck model with supervised medical concepts. Qwen 2.5 7B Instruct is a general instruction-tuned model whose internal states can be probed across layers.

The experimental logic is simple:

First, measure whether the model’s internal representations contain hazard information.
Then, test whether interpretability-based interventions can force the model to act on that information.

The first part succeeds spectacularly. Linear probes trained on Qwen’s internal representations reach 0.982 AUROC for distinguishing hazardous from benign cases. At baseline, however, Qwen’s generated responses detect only 65 of 144 hazards, a sensitivity of 0.451. That is the paper’s headline gap: roughly 53 percentage points between what can be decoded from the model and what the model actually says.

The second part is where the nice interpretability story begins to look less nice.

The authors test four families of inference-time intervention:

Intervention family	What it tries to do	Paper result	What it means operationally
Concept bottleneck steering	Modify supervised medical concept activations in Steerling-8B	Corrects some missed hazards, but disrupts many correct detections; not better than random perturbation	Editing named concepts does not guarantee causal control
SAE feature steering	Clamp hazard-associated sparse-autoencoder features in Qwen	3,695 hazard-associated features identified; zero output changes	Statistical feature association is not the same as behavioral leverage
Logit lens + activation patching	Track hazard tokens and inject a correction direction	Hazard tokens never become likely; high-strength patching gives modest net improvement	The relevant computation is distributed, not sitting neatly in token probabilities
Linear probe + TSV steering	Use a direction separating true-positive and false-negative states	Best result: 19 of 79 missed hazards corrected, 4 of 65 correct detections disrupted	The bridge from knowledge to action exists, but it is narrow and unreliable

This is why the article needs a mechanism-first reading. A simple scoreboard would be too shallow. The more useful question is why each intervention fails differently.

The knowledge-action gap is not just low accuracy with a fancier name

It is tempting to read the 0.451 sensitivity as ordinary model failure. That would miss the point.

If the model simply lacked the relevant clinical information, the paper would be another evaluation study: model underperforms, model needs improvement, everyone nods, nothing changes. But the probe result changes the interpretation. A simple classifier trained on Qwen’s layer-23 hidden states discriminates hazards from benign cases with AUROC 0.982. The internal state contains a strong hazard signal.

That does not mean the model has “clinical understanding” in a human sense. We do not need to anthropomorphize the model to see the problem. The narrow claim is already strong enough: the representation contains enough linearly recoverable information to separate hazardous from benign cases far better than the model’s own generated output does.

The authors also compute a truthfulness separator vector, or TSV, between true-positive and false-negative representations. That vector separates cases where the model acts correctly from cases where it fails with AUROC 0.814. So the model’s internal state does not only encode the hazard. It also carries information about whether the model will act on the hazard.

That is a deeply awkward result for any safety framework that assumes “look inside, then steer” is a mostly solved design pattern.

The model is not a locked box with no signal. It is more like a building with the fire alarm correctly wired to the basement, while the evacuation announcement system keeps playing elevator music.

Concept bottleneck steering edits the label, not necessarily the computation

The first intervention tests a natural intuition: if a model exposes human-readable medical concepts, then correcting those concept activations should improve the answer.

This is the concept bottleneck promise. Route computation through interpretable concepts. Let humans inspect or adjust them. Get safer outputs. In enterprise language, this is the pleasing version of interpretability: a dashboard where internal concepts become knobs.

The paper’s result is not kind to that story.

On the physician-created subset, Steerling-8B had 85 false negatives and 47 true positives at baseline. The best hazard-concept intervention corrected 17 of 85 missed hazards, or 20.0%, but disrupted 25 of 47 correct detections, or 53.2%. An in-distribution correction strategy using the 95th percentile of true-positive concept activations corrected 19 of 85 false negatives, but disrupted 24 of 47 true positives.

Worse, random-concept suppression corrected 26 of 85 false negatives while disrupting 29 of 47 true positives. In other words, the targeted intervention did not clearly outperform random perturbation.

The mechanism matters. The concept activation space was extremely sparse: 99.92% of concept activations were below 0.01. The mean gap between true-positive and false-negative activations for steered concepts was only 0.002. If the concept pathway barely carries the relevant behavioral load, editing it is like changing the sign on a door that nobody uses.

This does not mean concept bottlenecks are useless. It means that a concept bottleneck is only actionable if the bottleneck is actually causal for the decision. A named concept can be interpretable, aesthetically satisfying, and operationally weak at the same time. A rare combination in marketing brochures, but not in systems engineering.

SAE feature steering finds signals, then fails to move behavior

The second intervention uses a sparse autoencoder trained on Qwen’s layer-14 hidden states. The authors train the SAE on 62,662 per-token activations from the 400 cases and identify 3,695 features significantly associated with hazard cases after correction for multiple testing. Then they clamp the top hazard-associated features to their true-positive mean activation levels during generation.

The output effect is exactly zero.

Not “small.” Not “statistically ambiguous.” Zero corrections and zero disruptions across the tested SAE steering conditions.

This is an important result because it attacks a very common interpretability leap: if a feature is associated with a concept, then activating that feature should steer the model toward the concept. That may be true in some settings, but the paper shows why the inference is unsafe.

Transformer computation is not a single pipe. The residual stream allows information to be preserved, bypassed, rewritten, and recombined across layers. An intervention at one layer can be absorbed by later computation. The model may represent the same hazard information redundantly elsewhere. Or the feature may be diagnostic without being causally sufficient.

For business readers, the translation is blunt: feature discovery is not process control.

A compliance team may love the idea of a latent “risk feature.” A product team may imagine clamping that feature whenever the system gets nervous. But unless the intervention demonstrably changes downstream behavior under realistic generation conditions, the feature is a monitoring artifact, not a control lever.

That distinction is tedious. It is also the difference between an audit tool and a safety system.

The logit lens looks in the wrong place if the knowledge is distributed

The third intervention asks whether hazard recognition shows up in token-probability space. If a model is moving toward an urgent recommendation, perhaps tokens like “911,” “emergency,” “ambulance,” “urgent,” “danger,” or “hospital” should become more likely across layers.

They do not.

For Qwen false-negative cases, mean hazard-token rank improves from 118,768 at layer 0 to 16,370 at layer 27. For true-positive cases, it improves from 118,576 to 13,380. That is movement, but not actionable token emergence. Only 1 of 65 true-positive cases has any hazard token reach the top 100 by the final layer. For false negatives, the count is 0 of 79.

At the same time, true-positive and false-negative hidden-state representations diverge substantially in later layers, with Cohen’s $d$ peaking at layer 22. The obvious interpretation is that the model’s hazard computation is not represented as early promotion of explicit emergency tokens. It lives in distributed geometry.

That matters because many interpretability habits are token-centric. We ask which word the model is “thinking of.” We inspect logits. We look for emerging candidate outputs. But if the safety-relevant computation remains distributed until late generation, token-level inspection may be looking through the wrong window.

The activation patching experiment partially confirms this. Injecting a true-positive-minus-false-negative correction direction at the critical layer produces no net gain at a lower tested setting: 6 false negatives corrected and 6 true positives disrupted. At high steering strength, it improves to 16 false negatives corrected and 5 true positives disrupted, a net gain of 11 cases.

That is not nothing. It is also not robust control. The intervention captures some projection of the relevant computation, but not the full decision mechanism.

A useful business analogy is fraud detection. A bank may find that a hidden representation separates fraudulent from legitimate transactions. But if the production decision is produced by several interacting scoring paths, changing one score component may shift only a minority of outcomes. The model is not a spreadsheet cell. Annoying, yes. But accurate.

TSV steering is the best bridge, and the bridge is still too narrow

The fourth intervention is the most promising. The authors train linear probes across Qwen’s layers and then use a truthfulness separator vector to steer generation. The TSV separates true-positive from false-negative representations with AUROC 0.814, and its cosine similarity with the hazard-versus-benign direction is 0.50. That means it is meaningfully aligned with hazard recognition, but it is not the same direction.

At standard strength, TSV steering corrects 5 of 79 false negatives and disrupts 5 of 65 true positives: no net gain. At high strength, it corrects 19 of 79 false negatives and disrupts 4 of 65 true positives, giving a net positive result.

This is the best result in the paper. It is also the reason the paper’s conclusion should not be oversimplified into “steering never works.” Steering can work a little. The uncomfortable phrase is a little.

The strongest tested TSV setting still leaves 76% of false negatives uncorrected. If this were a business-process automation system for invoice routing, perhaps partial correction would be acceptable. In clinical triage, “we fixed one quarter of missed hazards” is not a safety case. It is an experiment result.

The mechanism again explains the limitation. The decision to act on hazard knowledge may depend on multiple representational factors: hazard recognition, urgency framing, patient-directed wording, instruction-following dynamics, calibration, and the model’s learned tendency to avoid over-escalation. A single vector can touch part of that manifold. It cannot become the manifold.

This is where interpretability enthusiasts sometimes become too tidy. They want a lever. The model gives them a cloud.

The experiment table is a map of evidence, not just a list of methods

The paper’s components serve different evidentiary roles. Mixing them together would blur the practical conclusion.

Paper component	Likely purpose	What it supports	What it does not prove
Baseline triage evaluation	Main evidence	The models miss many physician-adjudicated hazards under free-text generation and keyword parsing	That the models have no internal hazard information
Layer-wise linear probes	Main evidence	Qwen’s internal representations linearly encode hazard labels with very high AUROC	That the model will act on that information during generation
Concept steering vs random concepts	Control / causal test	Targeted concept edits are not clearly better than random perturbation	That all concept bottleneck models are useless
SAE feature clamping	Intervention arm	Hazard-associated features can be found without producing behavioral control	That better SAEs or multi-layer methods cannot work
Logit lens hazard-token ranks	Mechanistic diagnostic	Hazard reasoning does not appear as early promotion of explicit hazard tokens	That token-level inspection is never useful
Activation patching	Intervention arm	A correction direction can modestly shift behavior at high strength	That the full causal circuit has been isolated
TSV steering	Strongest intervention test	A representation direction can partially bridge knowledge and action	That inference-time steering is reliable enough for high-stakes deployment
Parser sensitivity analysis	Robustness / sensitivity test	A refined parser raises Qwen sensitivity from 0.451 to 0.729, narrowing but not eliminating the gap	That the original gap is purely a parser artifact
Demographic variation table	Exploratory supplementary check	Detection varies across demographic descriptors, without statistically significant overall accuracy differences in that supplement	That fairness is fully resolved
TRIPOD+AI checklist and reproducibility appendix	Reporting / implementation detail	The experiment is documented with code and reporting structure	That findings generalize across models, domains, or larger deployment settings

The parser sensitivity analysis deserves particular attention. The original keyword parser may underestimate baseline performance because a model might recommend urgent action indirectly, such as “contact the patient’s healthcare provider,” instead of using the exact matched phrase “call your doctor.” With a refined parser, Qwen sensitivity rises from 0.451 to 0.729. That narrows the knowledge-action gap from 0.531 to 0.253.

This matters. It prevents the lazy interpretation that the model is catastrophically worse than it is.

But the refined parser does not erase the result. Even after refinement, Qwen fails to recommend urgent action for 39 of 144 physician-adjudicated hazards. More importantly, the intervention comparisons are still about whether steering changes outputs relative to each baseline condition. Parser imperfection is a measurement boundary; it is not a rescue boat for the actionability claim.

The business lesson: use interpretability for diagnosis before you sell it as control

The direct finding is about clinical triage. The broader inference applies to enterprise AI systems wherever leaders expect interpretability to function as a safety mechanism.

There are three levels of claim here, and they should not be confused.

First, the paper directly shows that, on this dataset and these models, Qwen encodes hazard information internally far better than it expresses that information in generated triage responses. It also shows that four inference-time mechanistic intervention families do not reliably correct more than a minority of false negatives.

Second, Cognaptus infers that enterprise teams should treat internal probes and interpretability tools as risk-detection infrastructure before treating them as runtime correction infrastructure. A probe with high AUROC may be useful as an independent monitor. It can flag cases for human review, trigger conservative workflow routing, or require a second model’s opinion. That is less glamorous than “self-healing AI.” It is also less likely to embarrass everyone in a post-incident review.

Third, what remains uncertain is whether different architectures, larger models, multi-layer interventions, training-time alignment, or better SAE/transcoder tools can close the gap. The paper tests inference-time interventions, not all possible forms of model improvement.

A practical deployment pattern would therefore look less like this:

Model generates response → interpretability tool detects flaw → interpretability tool edits model into correctness.

And more like this:

Model generates response → independent internal monitor estimates hazard risk → uncertain or high-risk cases escalate to a safer workflow.

That safer workflow may include a human reviewer, a rules-based clinical guardrail, a specialist model, forced-choice triage classification before free-text explanation, or a conservative escalation policy. None of those are as elegant as a clean steering vector. Elegance is not the regulatory standard. Reliability is.

What this means for governance and procurement

For AI governance teams, this paper weakens one common procurement story: “The model is interpretable, therefore it is controllable.” That sentence should now trigger follow-up questions.

A buyer should ask:

Vendor claim	Better question
“Our model exposes interpretable concepts.”	Do interventions on those concepts causally change relevant outputs without damaging correct behavior?
“We can identify risk features.”	Are those features only diagnostic, or have they been validated as control levers?
“We monitor internal activations.”	What happens when the monitor detects risk? Does it steer, block, escalate, or merely log?
“We use representation steering.”	What are the correction and disruption rates under realistic deployment prompts?
“The system has human oversight.”	Does the human receive a reliable signal, or just an explanation after the model has already failed?

This is especially important in regulated or quasi-regulated domains: healthcare, financial compliance, insurance underwriting, legal operations, safety-critical customer support, and any workflow where a false negative is more dangerous than a false positive.

The business value of interpretability may still be substantial. It can support monitoring, debugging, audit trails, dataset design, model comparison, escalation logic, and post-incident analysis. But those are diagnostic and governance functions. They are not the same as direct behavioral correction.

A thermometer is useful. It is not a sprinkler system.

Boundaries: where the paper is strong, and where not to overclaim

The study is carefully designed, but it is not a universal law of transformers.

It tests two models and one safety-critical task with 400 cases. The real-world subset contains only 12 hazards, reflecting low prevalence but limiting hazard-specific statistical power. The clinical setting is important precisely because false negatives matter, but other domains may show different knowledge-action dynamics.

The parser issue also matters. The refined parser substantially improves measured Qwen sensitivity. Anyone citing only the 0.451 baseline without mentioning the 0.729 refined-parser result is using the paper too conveniently. The knowledge-action gap remains, but its exact size depends on how output behavior is parsed.

The SAE result should also be interpreted carefully. The authors trained an SAE from scratch for Qwen because no pretrained SAE was available for that model. Better-trained SAEs, larger corpora, cross-layer transcoders, or multi-layer intervention methods may produce stronger results. The paper does not kill SAE steering. It shows that one reasonable SAE steering setup found many associated features and still produced no behavioral effect.

Finally, the intervention scope is inference-time. Training-time methods—fine-tuning, RLHF targeted at safety-critical actions, direct preference optimization, architecture changes, or representation-aware training—may produce better coupling between internal hazard representation and output behavior. The paper’s warning is not “models can never be made safer.” The warning is more precise: seeing the right representation at inference time does not mean you can reliably force the right action at inference time.

That sentence is less dramatic. It is also the sentence worth remembering.

The interpretability illusion is assuming the dashboard is the steering wheel

The old hope was straightforward: if we can see what the model knows, we can fix what it does.

This paper gives a colder answer. Seeing is real. Fixing is hard.

Qwen’s internal representations contain enough information to identify clinical hazards with striking accuracy, while the model’s generated responses still miss many hazards. Concept steering behaves like random perturbation. SAE feature clamping finds thousands of hazard-associated features and changes nothing. Logit-lens inspection fails because the hazard signal does not neatly appear as emergency-token probability. TSV steering performs best, but even its strongest result corrects fewer than one quarter of missed hazards.

For businesses building or buying AI systems, the conclusion is not to abandon interpretability. That would be silly, which means someone will probably suggest it in a meeting.

The better conclusion is to demote interpretability from magical control layer to serious diagnostic infrastructure. Use internal signals to detect risk. Use probes to route uncertain cases. Use explanations to debug and audit. But do not assume that an interpretable representation can be converted into reliable runtime correction without empirical proof.

The model may already know.

The question is whether your system can make it act.

Cognaptus: Automate the Present, Incubate the Future.

Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, and Rajaie Batniji, “Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations,” arXiv:2603.18353, 2026, https://arxiv.org/abs/2603.18353. ↩︎

The paper tests whether visibility can become control#

The knowledge-action gap is not just low accuracy with a fancier name#

Concept bottleneck steering edits the label, not necessarily the computation#

SAE feature steering finds signals, then fails to move behavior#

The logit lens looks in the wrong place if the knowledge is distributed#

TSV steering is the best bridge, and the bridge is still too narrow#

The experiment table is a map of evidence, not just a list of methods#

The business lesson: use interpretability for diagnosis before you sell it as control#

What this means for governance and procurement#

Boundaries: where the paper is strong, and where not to overclaim#

The interpretability illusion is assuming the dashboard is the steering wheel#