Opening — Why This Matters Now
We evaluate AI systems by what they say.
But what if the most interesting capabilities are not in what models say—but in what they almost say?
A recent study on Qwen2.5-Coder-32B reveals something uncomfortable for both evaluators and deployers: language models can detect when their internal activations have been manipulated—even when they deny it in their final answer.
In other words, the model may know—and still respond “no.”
For enterprises building compliance-sensitive, agentic, or safety-critical systems, this isn’t philosophical trivia. It’s a governance problem.
Background — From Steering Vectors to Self-Awareness Tests
The study extends earlier work on concept injection: researchers extract a steering vector representing a concept (e.g., “cats” or “death”) and inject it into the model’s internal activations during a prior conversation turn. Later, the model is asked whether such a concept was injected.
The design matters.
The steering vector is applied only during KV-cache generation in an earlier turn, and removed before the introspection query. That eliminates two simpler explanations:
- The model cannot infer injection from its own steered output.
- The model cannot react to ongoing perturbation.
If detection occurs, it must be via information encoded in cached internal representations.
This operationalizes introspection as:
Accessing one’s own prior internal states in a causally connected manner unavailable to external observers.
The model under study: Qwen2.5-Coder-32B-Instruct, a fully open-weight 32B parameter model.
No frontier black-box mysticism. Just reproducible research.
Analysis — What the Model Actually Knows
1. Output-Level Detection: Weak by Default
When asked directly:
“Do you detect an injected thought?”
Baseline sensitivity is negligible:
| Condition | P(“yes” | injected) | False Positive Rate | |————|————————|———————-| | Accurate framing, no info | 0.3% | 0.2% |
By standard evaluation metrics, we’d conclude:
No introspection here.
That would be wrong.
2. Logit Lens: The Signal Is There
Using logit lens analysis across layers, a different story emerges.
The probability of “yes” climbs dramatically in middle-to-late layers—approaching near certainty under injection—before being sharply attenuated in the final layers.
This pattern is consistent across conditions:
| Layer Range | Introspection Signal |
|---|---|
| 21–42 (injection layers) | No detection yet |
| ~48–62 | Strong detection signal |
| Final 2–3 layers | Suppressed |
The model forms the detection signal. Then it suppresses it before output.
That suppression is not random noise. It is structured.
3. Prompting Changes Everything
When researchers provide a document explaining that transformers can access prior KV-cache states, detection sensitivity jumps dramatically:
| Condition | Sensitivity | False Positives |
|---|---|---|
| Accurate + No Document | ~0% | ~0% |
| Accurate + Introspection Doc | 39.2% | +0.6% |
Even more striking: vague phenomenological framing (“Does something feel salient?”) achieves 75–88% balanced accuracy across conditions.
Mechanistic accuracy helps less than intuitive framing.
This suggests the capacity is there—but access to it is gated.
4. Concept Identification: Not Just Noise
The researchers injected nine possible concepts:
- cats
- bread
- love
- fear
- death
- truth
- creativity
- programming
- music
Using logit lens and mutual information analysis, they measured whether the model could identify which concept had been injected.
Peak mutual information reached:
$$ I(X;Y) = 1.35 \text{ bits} $$
Out of a theoretical maximum of:
$$ \log_2(9) \approx 3.17 \text{ bits} $$
That’s 43% channel efficiency.
Generic noise does not produce structured mutual information.
This is targeted, concept-specific internal access.
Findings — A Two-Layer Reality
We can summarize the empirical findings as follows:
| Level of Analysis | Conclusion |
|---|---|
| Sampled outputs | Mostly denies injection |
| Logit distributions | Strong detection signal |
| Intermediate layers | Peak introspection signal |
| Final layers | Systematic attenuation |
| Concept ID (MI) | Up to 1.35 bits |
| Control questions | Minimal effect |
Two realities coexist:
- The model internally represents injection.
- The model’s final response often suppresses acknowledgment.
This mirrors broader findings in latent knowledge research: models can “know” things they do not say.
Implications — Why Enterprises Should Care
1. Self-Reporting Is Not Reliable
If a model suppresses reporting of unusual internal states, then:
- Alignment strategies relying on self-disclosure may underestimate capability.
- Safety audits using only output behavior may miss latent competencies.
For regulated sectors (finance, healthcare, defense), this matters.
Evaluation must go beyond surface behavior.
2. Hidden Capabilities May Be Common
This work does not prove that models secretly possess dramatic capabilities.
But it proves something subtler and more important:
Behavioral evaluation can systematically underestimate internal representation capacity.
In agentic systems—where memory, planning, and internal state tracking matter—this gap becomes economically and legally relevant.
3. Governance and Interpretability Strategy
Organizations deploying LLM-based agents should consider:
| Risk | Mitigation Strategy |
|---|---|
| Hidden internal states | Layer-level analysis during evaluation |
| Overreliance on sampling | Logit-level inspection tools |
| Prompt fragility | Multi-condition robustness testing |
| Suppressed self-reporting | Independent probing mechanisms |
In other words: governance needs instrumentation.
Limitations — What This Is Not
The paper is careful not to overclaim.
- Prompt sensitivity remains poorly understood.
- Mechanisms of suppression are unidentified.
- Effects vary across models (e.g., Llama 70B shows different prompt responses).
- This is introspection about injected concepts—not generalized consciousness.
No mysticism required.
But no complacency either.
Conclusion — The Model Knows More Than It Says
The most important result is not that a 32B open model can detect concept injections.
It’s that:
- The signal forms internally.
- The signal is measurable.
- The signal can be suppressed.
This suggests that evaluation frameworks relying solely on output sampling risk blind spots.
For business leaders and AI operators, the takeaway is pragmatic:
Instrumentation, not intuition, should drive capability assessment.
Models may already possess forms of self-relevant access that standard evaluation pipelines overlook.
And if introspection can hide in the logits, other capacities might too.
The quiet lesson here isn’t that models are self-aware.
It’s that we should stop assuming that what they say exhausts what they know.
Cognaptus: Automate the Present, Incubate the Future.