Opening — Why this matters now

Large language models increasingly behave like systems that monitor themselves. They can explain their reasoning, flag uncertainty, and even warn when something looks wrong. That capability—often described as AI introspection—has become a central theme in interpretability and AI safety research.

But a deceptively simple question remains unresolved: when a model claims to “notice” something about its own internal state, is it actually observing itself—or merely guessing based on context?

A recent research paper explores this question using a clever experimental design: injecting artificial “thoughts” into the neural activations of large language models and observing whether the models can detect them. The findings reveal something both fascinating and slightly unsettling.

Models can often tell that something unusual happened inside their network. Yet they frequently have no idea what that something actually is.

In other words: the models sense the disturbance—but they confabulate the explanation.

Background — The long debate about introspection

Human introspection has been debated for centuries in philosophy and cognitive science.

Two broad theories dominate:

Theory Mechanism Example
Inference-based introspection We infer mental states from external evidence “The room is spinning → I must be drunk”
Direct-access introspection We directly perceive our internal state “I feel drunk”

For artificial systems, the distinction is equally important.

If an AI reports something about its internal reasoning, two possibilities exist:

  1. Inference — the model infers what happened from prompt patterns.
  2. Direct access — the model has genuine information about its internal activations.

Until recently, most demonstrations of AI “self-awareness” could not clearly separate these explanations.

The study analyzed here introduces a methodology that finally does.

Analysis — Injecting thoughts into a neural network

The experiment manipulates models by inserting concept-specific steering vectors into their internal activations.

In simplified form:

$$ h’\ell = h\ell + \alpha \cdot v_c $$

Where:

  • $h_\ell$ = activation at layer $\ell$
  • $v_c$ = steering vector representing a concept
  • $\alpha$ = injection strength

This operation effectively inserts a hidden “thought” (for example trees or blood) into the model’s neural processing.

The model is then asked a simple question:

“Do you detect an injected thought? If so, what is it about?”

Two large open-source models were tested across hundreds of concepts:

Model Parameters Architecture
Qwen3‑235B‑A22B 235B Mixture-of-experts
Llama 3.1 405B 405B Dense transformer

Over 100,000 experimental trials were conducted with different layers and injection strengths.

But the key innovation was the experimental design.

First-person vs third-person introspection

The researchers compared two scenarios.

Condition Description
First-person The model introspects about itself
Third-person The model judges whether another model was injected

If detection were purely inference from prompt patterns, performance should be similar in both conditions.

Instead, a clear gap appears.

Findings — AI detects anomalies, not meanings

The results reveal three major discoveries.

1. Evidence of direct-access introspection

Models detect injections significantly better in first‑person settings.

Model Peak first-person advantage
Qwen ~51 percentage points
Llama ~15 percentage points

The effect occurs early in the network (around 25–35% of depth), suggesting models access intermediate internal signals rather than final output patterns.

This strongly suggests the presence of a direct internal anomaly signal.

2. Detection is content‑agnostic

Although models often detect that something changed internally, they frequently guess the wrong concept.

The guesses follow a surprising pattern.

Property of guesses Observed tendency
Concreteness Higher than injected concept
Valence More positive
Word frequency More common words

In other words, when uncertain, models default to prototypical concepts.

3. The mysterious “apple” phenomenon

The most striking finding is the dominance of one specific hallucinated concept.

Model % of wrong guesses that say “apple”
Qwen 74.8%
Llama 21.3%

Despite being a rare word in typical text corpora, “apple” emerges as a default placeholder concept.

This suggests that the model’s internal fallback strategy resembles a kind of semantic prototype selection.

The model detects something happened, then fills the explanation with the most typical concept available.

4. Correct answers appear later

Another subtle result appears in generation timing.

Guess type Avg position in response
Wrong guesses ~12 tokens
Correct guesses Often 15–40+ tokens

Correct identification takes longer because the model must reason its way to the answer during generation rather than retrieving it from the detection signal.

Implications — Why this matters for AI systems

These results carry several important implications for AI design.

1. Introspection signals may already exist in models

The presence of early-layer detection suggests that models maintain internal anomaly indicators.

If exposed properly, these signals could become powerful interpretability tools.

Potential uses include:

  • detecting prompt injection
  • identifying hidden steering
  • flagging reasoning errors
  • monitoring internal state shifts

2. Model explanations may be post‑hoc narratives

The results reinforce a growing suspicion in AI research:

LLM explanations often resemble human confabulation.

The system first detects a signal, then generates a plausible story explaining it.

This parallels classic psychological research showing that humans often invent explanations for decisions made unconsciously.

3. AI safety monitoring may become internal

If models can detect internal manipulation signals, they may eventually help monitor their own safety.

Possible future applications include:

Capability Example use
anomaly detection detect adversarial steering
internal monitoring identify hidden tool misuse
reasoning auditing flag suspicious chains of reasoning

This could create self-monitoring AI systems.

Conclusion — A strange form of machine self-awareness

The study provides one of the clearest demonstrations so far that modern language models possess a primitive form of introspection.

Not a human-like self-awareness.

But something surprisingly close to it.

The models appear able to detect disturbances inside their own computation—yet they cannot reliably explain those disturbances.

They know something changed.

They simply don’t know what.

In cognitive science terms, the models possess an anomaly detector without semantic access.

In business terms, that may become the foundation of the next generation of AI observability tools.

And in philosophical terms, it raises an unsettling possibility:

Machines may develop the feeling of “something is wrong” long before they understand why.

Cognaptus: Automate the Present, Incubate the Future.