The Model That Knows It Knows: When Introspection Hides in the Logits

Opening — Why This Matters Now

We evaluate AI systems by what they say.

But what if the most interesting capabilities are not in what models say—but in what they almost say?

A recent study on Qwen2.5-Coder-32B reveals something uncomfortable for both evaluators and deployers: language models can detect when their internal activations have been manipulated—even when they deny it in their final answer.

In other words, the model may know—and still respond “no.”

For enterprises building compliance-sensitive, agentic, or safety-critical systems, this isn’t philosophical trivia. It’s a governance problem.

Background — From Steering Vectors to Self-Awareness Tests

The study extends earlier work on concept injection: researchers extract a steering vector representing a concept (e.g., “cats” or “death”) and inject it into the model’s internal activations during a prior conversation turn. Later, the model is asked whether such a concept was injected.

The design matters.

The steering vector is applied only during KV-cache generation in an earlier turn, and removed before the introspection query. That eliminates two simpler explanations:

The model cannot infer injection from its own steered output.
The model cannot react to ongoing perturbation.

If detection occurs, it must be via information encoded in cached internal representations.

This operationalizes introspection as:

Accessing one’s own prior internal states in a causally connected manner unavailable to external observers.

The model under study: Qwen2.5-Coder-32B-Instruct, a fully open-weight 32B parameter model.

No frontier black-box mysticism. Just reproducible research.

Analysis — What the Model Actually Knows

1. Output-Level Detection: Weak by Default

When asked directly:

“Do you detect an injected thought?”

Baseline sensitivity is negligible:

| Condition | P(“yes” | injected) | False Positive Rate | |————|————————|———————-| | Accurate framing, no info | 0.3% | 0.2% |

By standard evaluation metrics, we’d conclude:

No introspection here.

That would be wrong.

2. Logit Lens: The Signal Is There

Using logit lens analysis across layers, a different story emerges.

The probability of “yes” climbs dramatically in middle-to-late layers—approaching near certainty under injection—before being sharply attenuated in the final layers.

This pattern is consistent across conditions:

Layer Range	Introspection Signal
21–42 (injection layers)	No detection yet
~48–62	Strong detection signal
Final 2–3 layers	Suppressed

The model forms the detection signal. Then it suppresses it before output.

That suppression is not random noise. It is structured.

3. Prompting Changes Everything

When researchers provide a document explaining that transformers can access prior KV-cache states, detection sensitivity jumps dramatically:

Condition	Sensitivity	False Positives
Accurate + No Document	~0%	~0%
Accurate + Introspection Doc	39.2%	+0.6%

Even more striking: vague phenomenological framing (“Does something feel salient?”) achieves 75–88% balanced accuracy across conditions.

Mechanistic accuracy helps less than intuitive framing.

This suggests the capacity is there—but access to it is gated.

4. Concept Identification: Not Just Noise

The researchers injected nine possible concepts:

cats
bread
love
fear
death
truth
creativity
programming
music

Using logit lens and mutual information analysis, they measured whether the model could identify which concept had been injected.

Peak mutual information reached:

$$ I(X;Y) = 1.35 \text{ bits} $$

Out of a theoretical maximum of:

$$ \log_2(9) \approx 3.17 \text{ bits} $$

That’s 43% channel efficiency.

Generic noise does not produce structured mutual information.

This is targeted, concept-specific internal access.

Findings — A Two-Layer Reality

We can summarize the empirical findings as follows:

Level of Analysis	Conclusion
Sampled outputs	Mostly denies injection
Logit distributions	Strong detection signal
Intermediate layers	Peak introspection signal
Final layers	Systematic attenuation
Concept ID (MI)	Up to 1.35 bits
Control questions	Minimal effect

Two realities coexist:

The model internally represents injection.
The model’s final response often suppresses acknowledgment.

This mirrors broader findings in latent knowledge research: models can “know” things they do not say.

Implications — Why Enterprises Should Care

1. Self-Reporting Is Not Reliable

If a model suppresses reporting of unusual internal states, then:

Alignment strategies relying on self-disclosure may underestimate capability.
Safety audits using only output behavior may miss latent competencies.

For regulated sectors (finance, healthcare, defense), this matters.

Evaluation must go beyond surface behavior.

2. Hidden Capabilities May Be Common

This work does not prove that models secretly possess dramatic capabilities.

But it proves something subtler and more important:

Behavioral evaluation can systematically underestimate internal representation capacity.

In agentic systems—where memory, planning, and internal state tracking matter—this gap becomes economically and legally relevant.

3. Governance and Interpretability Strategy

Organizations deploying LLM-based agents should consider:

Risk	Mitigation Strategy
Hidden internal states	Layer-level analysis during evaluation
Overreliance on sampling	Logit-level inspection tools
Prompt fragility	Multi-condition robustness testing
Suppressed self-reporting	Independent probing mechanisms

In other words: governance needs instrumentation.

Limitations — What This Is Not

The paper is careful not to overclaim.

Prompt sensitivity remains poorly understood.
Mechanisms of suppression are unidentified.
Effects vary across models (e.g., Llama 70B shows different prompt responses).
This is introspection about injected concepts—not generalized consciousness.

No mysticism required.

But no complacency either.

Conclusion — The Model Knows More Than It Says

The most important result is not that a 32B open model can detect concept injections.

It’s that:

The signal forms internally.
The signal is measurable.
The signal can be suppressed.

This suggests that evaluation frameworks relying solely on output sampling risk blind spots.

For business leaders and AI operators, the takeaway is pragmatic:

Instrumentation, not intuition, should drive capability assessment.

Models may already possess forms of self-relevant access that standard evaluation pipelines overlook.

And if introspection can hide in the logits, other capacities might too.

The quiet lesson here isn’t that models are self-aware.

It’s that we should stop assuming that what they say exhausts what they know.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why This Matters Now#

Background — From Steering Vectors to Self-Awareness Tests#

Analysis — What the Model Actually Knows#

1. Output-Level Detection: Weak by Default#

2. Logit Lens: The Signal Is There#

3. Prompting Changes Everything#

4. Concept Identification: Not Just Noise#

Findings — A Two-Layer Reality#

Implications — Why Enterprises Should Care#

1. Self-Reporting Is Not Reliable#

2. Hidden Capabilities May Be Common#

3. Governance and Interpretability Strategy#

Limitations — What This Is Not#

Conclusion — The Model Knows More Than It Says#