When Models Know But Won’t Act: The Interpretability Illusion

Opening — Why this matters now

There is a quiet assumption baked into most AI governance frameworks: if we can see what a model is thinking, we can fix it when it goes wrong.

It’s a comforting idea. Regulators like it. Engineers build tooling around it. Consultants sell it.

Unfortunately, this paper demonstrates something far less convenient: models can know the right answer internally—and still fail to act on it.

Not occasionally. Systematically.

And interpretability, for all its elegance, does very little to change that.

Background — The promise of interpretability

Over the past few years, mechanistic interpretability has evolved into a sophisticated toolkit:

Concept bottlenecks: forcing models to reason via human-readable features
Sparse autoencoders (SAEs): extracting interpretable latent features
Linear probes: revealing what information is encoded internally
Activation patching & logit lens: tracing causal pathways in generation

The implicit promise is simple:

If we can locate knowledge inside the model, we can intervene and correct its behavior.

This assumption underpins not just research—but regulation. The EU AI Act and FDA guidance both lean on interpretability as a foundation for oversight.

The paper tests that assumption directly.

And rather brutally.

Analysis — What the paper actually did

The study evaluates four interpretability-based intervention methods on a clinical triage task—a setting where missing an error is not theoretical, but operationally dangerous.

Experimental Setup

400 physician-adjudicated cases
144 hazards vs 256 benign cases
Two models:
- Steerling-8B (concept bottleneck model)
- Qwen 2.5 7B (standard LLM)

The key idea:

Measure what the model knows internally (via probes)
Try to force it to act on that knowledge (via interventions)

The Four Intervention Paradigms

Method	Mechanism	Intuition	Outcome
Concept Steering	Modify concept activations	“Fix reasoning layer”	Random-like behavior
SAE Feature Steering	Clamp latent features	“Activate correct signals”	No effect
Activation Patching	Inject correction vectors	“Repair computation path”	Weak effect
TSV Steering	Push toward truth direction	“Align representation with truth”	Partial success

The diversity here matters: these are not variations of one idea—they represent the major schools of interpretability intervention.

And yet, they converge to the same result.

Findings — The uncomfortable numbers

1. The Knowledge–Action Gap

The most striking result is almost offensive in its clarity:

Metric	Value
Internal knowledge (probe AUROC)	0.982
Actual task performance (sensitivity)	0.451
Gap	~53 percentage points

The model knows the correct classification nearly perfectly.

It simply doesn’t act on it.

This is not noise. This is structure.

2. Interventions Mostly Fail

Method	FN Corrected	TP Disrupted	Net Effect
Concept Steering	20%	53%	Negative
SAE Steering	0%	0%	Null
Activation Patching	~7%	~9%	Neutral
TSV Steering (strong)	24%	6%	Positive (limited)

Even the best method—TSV steering—leaves 76% of errors untouched.

Interpretability gives visibility, not control.

3. Why the methods fail (mechanistically)

The paper doesn’t stop at results—it diagnoses failure modes.

(1) Concept bottlenecks don’t matter enough

99.92% of concept activations are near zero
Intervening on them barely propagates

Translation: You’re editing a layer the model barely uses.

(2) The residual stream cancels your intervention

SAE features identified → 3,695 significant signals
Steering effect → zero

Because transformers can bypass any single layer.

Translation: The model routes around your fix.

(3) Important knowledge never appears in tokens

Hazard tokens never reach top predictions
Yet internal representations strongly separate cases

Translation: The model “understands” in latent space—but doesn’t verbalize it.

(4) Behavior is not one-dimensional

TSV works only at extreme strength
Moderate alignment between “truth” and “decision” vectors

Translation: There is no single “truth direction” controlling behavior.

Implications — What this breaks (and what replaces it)

1. Interpretability ≠ controllability

This is the paper’s most important contribution.

We must distinguish:

Concept	Meaning
Interpretability	Understanding internal representations
Actionability	Ability to change outputs

Most current AI safety frameworks assume the first implies the second.

It doesn’t.

2. Rethinking “human-in-the-loop” oversight

If even perfect internal signals cannot reliably correct behavior, then:

Monitoring activations is insufficient
Real-time intervention is unreliable

A more viable approach emerges:

Use internal signals for detection, not correction

For example:

Strategy	Role
Linear probes	Risk detection
Human review	Final decision
Model output	Suggestion only

This is less elegant—but more honest.

3. The shift toward training-time solutions

If inference-time steering fails, the leverage moves upstream:

Reinforcement learning for safety-critical actions
Representation shaping during training
Multi-layer or architecture-level constraints

In other words:

You don’t fix behavior at runtime. You bake it into the model.

4. Architectural implications (quiet but profound)

The findings hint at a deeper issue:

Knowledge is distributed across layers
Decision-making is emergent, not localized
The residual stream allows “knowing without acting”

This suggests that the knowledge–action gap may not be a bug.

It may be a feature of autoregressive transformers.

Conclusion — The illusion of transparency

Interpretability has delivered something remarkable: we can now see inside models with surprising clarity.

But visibility is not control.

This paper shows that even when models encode near-perfect knowledge, they may still fail to act—and our current tools cannot reliably force them to.

The uncomfortable takeaway is this:

We understand more than we can control.

For business, this translates into a simple rule:

Don’t rely on interpretability as a safety mechanism
Treat it as a diagnostic tool
Build control elsewhere—training, architecture, or external systems

Because the model already knows.

It just doesn’t always care.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The promise of interpretability#

Analysis — What the paper actually did#

Experimental Setup#

The Four Intervention Paradigms#

Findings — The uncomfortable numbers#

1. The Knowledge–Action Gap#

2. Interventions Mostly Fail#

3. Why the methods fail (mechanistically)#

(1) Concept bottlenecks don’t matter enough#

(2) The residual stream cancels your intervention#

(3) Important knowledge never appears in tokens#

(4) Behavior is not one-dimensional#

Implications — What this breaks (and what replaces it)#

1. Interpretability ≠ controllability#

2. Rethinking “human-in-the-loop” oversight#

3. The shift toward training-time solutions#

4. Architectural implications (quiet but profound)#

Conclusion — The illusion of transparency#