Opening — Why this matters now

There is a quiet assumption baked into most AI governance frameworks: if we can see what a model is thinking, we can fix it when it goes wrong.

It’s a comforting idea. Regulators like it. Engineers build tooling around it. Consultants sell it.

Unfortunately, this paper demonstrates something far less convenient: models can know the right answer internally—and still fail to act on it.

Not occasionally. Systematically.

And interpretability, for all its elegance, does very little to change that.


Background — The promise of interpretability

Over the past few years, mechanistic interpretability has evolved into a sophisticated toolkit:

  • Concept bottlenecks: forcing models to reason via human-readable features
  • Sparse autoencoders (SAEs): extracting interpretable latent features
  • Linear probes: revealing what information is encoded internally
  • Activation patching & logit lens: tracing causal pathways in generation

The implicit promise is simple:

If we can locate knowledge inside the model, we can intervene and correct its behavior.

This assumption underpins not just research—but regulation. The EU AI Act and FDA guidance both lean on interpretability as a foundation for oversight.

The paper tests that assumption directly.

And rather brutally.


Analysis — What the paper actually did

The study evaluates four interpretability-based intervention methods on a clinical triage task—a setting where missing an error is not theoretical, but operationally dangerous.

Experimental Setup

  • 400 physician-adjudicated cases

  • 144 hazards vs 256 benign cases

  • Two models:

    • Steerling-8B (concept bottleneck model)
    • Qwen 2.5 7B (standard LLM)

The key idea:

  1. Measure what the model knows internally (via probes)
  2. Try to force it to act on that knowledge (via interventions)

The Four Intervention Paradigms

Method Mechanism Intuition Outcome
Concept Steering Modify concept activations “Fix reasoning layer” Random-like behavior
SAE Feature Steering Clamp latent features “Activate correct signals” No effect
Activation Patching Inject correction vectors “Repair computation path” Weak effect
TSV Steering Push toward truth direction “Align representation with truth” Partial success

The diversity here matters: these are not variations of one idea—they represent the major schools of interpretability intervention.

And yet, they converge to the same result.


Findings — The uncomfortable numbers

1. The Knowledge–Action Gap

The most striking result is almost offensive in its clarity:

Metric Value
Internal knowledge (probe AUROC) 0.982
Actual task performance (sensitivity) 0.451
Gap ~53 percentage points

The model knows the correct classification nearly perfectly.

It simply doesn’t act on it.

This is not noise. This is structure.


2. Interventions Mostly Fail

Method FN Corrected TP Disrupted Net Effect
Concept Steering 20% 53% Negative
SAE Steering 0% 0% Null
Activation Patching ~7% ~9% Neutral
TSV Steering (strong) 24% 6% Positive (limited)

Even the best method—TSV steering—leaves 76% of errors untouched.

Interpretability gives visibility, not control.


3. Why the methods fail (mechanistically)

The paper doesn’t stop at results—it diagnoses failure modes.

(1) Concept bottlenecks don’t matter enough

  • 99.92% of concept activations are near zero
  • Intervening on them barely propagates

Translation: You’re editing a layer the model barely uses.


(2) The residual stream cancels your intervention

  • SAE features identified → 3,695 significant signals
  • Steering effect → zero

Because transformers can bypass any single layer.

Translation: The model routes around your fix.


(3) Important knowledge never appears in tokens

  • Hazard tokens never reach top predictions
  • Yet internal representations strongly separate cases

Translation: The model “understands” in latent space—but doesn’t verbalize it.


(4) Behavior is not one-dimensional

  • TSV works only at extreme strength
  • Moderate alignment between “truth” and “decision” vectors

Translation: There is no single “truth direction” controlling behavior.


Implications — What this breaks (and what replaces it)

1. Interpretability ≠ controllability

This is the paper’s most important contribution.

We must distinguish:

Concept Meaning
Interpretability Understanding internal representations
Actionability Ability to change outputs

Most current AI safety frameworks assume the first implies the second.

It doesn’t.


2. Rethinking “human-in-the-loop” oversight

If even perfect internal signals cannot reliably correct behavior, then:

  • Monitoring activations is insufficient
  • Real-time intervention is unreliable

A more viable approach emerges:

Use internal signals for detection, not correction

For example:

Strategy Role
Linear probes Risk detection
Human review Final decision
Model output Suggestion only

This is less elegant—but more honest.


3. The shift toward training-time solutions

If inference-time steering fails, the leverage moves upstream:

  • Reinforcement learning for safety-critical actions
  • Representation shaping during training
  • Multi-layer or architecture-level constraints

In other words:

You don’t fix behavior at runtime. You bake it into the model.


4. Architectural implications (quiet but profound)

The findings hint at a deeper issue:

  • Knowledge is distributed across layers
  • Decision-making is emergent, not localized
  • The residual stream allows “knowing without acting”

This suggests that the knowledge–action gap may not be a bug.

It may be a feature of autoregressive transformers.


Conclusion — The illusion of transparency

Interpretability has delivered something remarkable: we can now see inside models with surprising clarity.

But visibility is not control.

This paper shows that even when models encode near-perfect knowledge, they may still fail to act—and our current tools cannot reliably force them to.

The uncomfortable takeaway is this:

We understand more than we can control.

For business, this translates into a simple rule:

  • Don’t rely on interpretability as a safety mechanism
  • Treat it as a diagnostic tool
  • Build control elsewhere—training, architecture, or external systems

Because the model already knows.

It just doesn’t always care.


Cognaptus: Automate the Present, Incubate the Future.