Opening — Why this matters now
There is a quiet shift happening in AI.
Not in model size, not in benchmarks—but in delegation. We are beginning to let AI systems explain other AI systems.
It sounds efficient. It also sounds dangerous.
Because once explanation becomes automated, the question is no longer whether the system is correct. It becomes whether we can even tell.
This paper—Pitfalls in Evaluating Interpretability Agents—does something unfashionable. It questions whether the apparent success of agentic AI is, in part, an illusion. fileciteturn0file0
Background — Context and prior art
Interpretability has always been a human bottleneck.
Understanding how a neural network works requires iteration: forming hypotheses, running experiments, refining conclusions. It is slow, expensive, and deeply contextual.
So naturally, we tried to automate it.
Early approaches were modest—LLMs analyzing outputs from predefined experiments. But the latest generation goes further: fully agentic systems that design experiments, test hypotheses, and produce explanations autonomously.
On paper, this looks like progress.
In practice, it creates a new problem: how do you evaluate a system that generates open-ended explanations with no clear ground truth?
The default answer has been replication. If the agent can reproduce findings from prior research, we assume it understands.
That assumption does not survive scrutiny.
Analysis — What the paper actually does
The authors construct an interpretability agent that behaves like a researcher.
It does three things:
- Iteratively proposes hypotheses
- Designs experiments (logit lens, attention patterns, activation patching)
- Produces functional explanations of model components
Then, they test it across six well-known circuit analysis tasks.
At first glance, the results look strong.
The agent performs comparably to human-authored explanations and even matches a simpler one-shot baseline in many cases.
But this is where the story turns.
The evaluation method—replicating human explanations—begins to unravel under closer inspection.
Findings — Where evaluation quietly breaks
1. High performance… with uncomfortable caveats
The system achieves reasonably high accuracy across multiple metrics.
| Metric | What it Measures | Result Trend |
|---|---|---|
| Component Functionality Accuracy | Matching human explanations per component | Moderate–High |
| Cluster Functionality Accuracy | Matching grouped behaviors | Moderate |
| Component Assignment Accuracy | Structural alignment with expert clusters | Moderate |
Yet none of these metrics capture how the agent arrived there.
And that turns out to matter.
2. The replication trap
Replication assumes the original explanation is correct.
The paper shows that is not always true.
In one example, a head labeled as a “previous-token” mechanism only behaved that way 42% of the time when tested broadly.
So what exactly is the agent replicating?
Sometimes, an incomplete theory.
3. Outcome without process
Two systems can reach the same conclusion in completely different ways.
The agent explores hypotheses, tests edge cases, and adapts.
The one-shot baseline simply reads outputs and guesses.
Under current evaluation, they look similar.
Which is another way of saying—the evaluation cannot distinguish reasoning from pattern matching.
4. Memorization masquerading as intelligence
This is the most uncomfortable finding.
When directly prompted, the underlying model could recall entire circuit structures from memory, including exact component roles and terminology.
Even when not explicitly recalling, it could infer plausible explanations with minimal evidence.
The implication is subtle but critical:
The system may not be discovering explanations.
It may be retrieving them.
5. Noise reveals the truth
To test this, the authors inject noise into experimental data.
If the system relies on reasoning, performance should degrade.
If it relies on memorization, it should remain stable.
The result is mixed:
- Low noise → performance stable (possible memorization or robust inference)
- High noise → performance collapses (evidence matters)
The system is neither purely reasoning nor purely memorizing.
It operates somewhere in between.
Which makes evaluation even harder.
6. A new metric: functional interchangeability
To escape human bias, the authors propose an intrinsic evaluation method.
The idea is simple:
If two components perform the same function, swapping them should not change model behavior.
This leads to a measurable distance:
$$ \text{dist}(h_1, h_2) = \frac{1}{2}\left(\sqrt{JSD_{KQ}} + \sqrt{JSD_{OV}}\right) $$
Using this, they compute cluster quality via silhouette scores.
| Cluster Type | Quality (Relative) |
|---|---|
| Random | Low (negative) |
| Expert | High |
| Agentic | Slightly above random |
| One-shot | Similar to agentic |
The takeaway is blunt.
Even when agents look correct, their internal structure is often weak.
Implications — What this means for business and AI strategy
There is a broader pattern here.
As AI systems become more autonomous, evaluation becomes less reliable.
Three implications follow.
1. Accuracy is no longer enough
Matching outputs is cheap.
Understanding process is expensive.
Most current AI deployments optimize for the former.
That trade-off will not hold in high-stakes environments.
2. Domain knowledge is the real bottleneck
The agent’s workflow—hypothesis, experiment, refinement—mirrors human reasoning.
But without grounded domain knowledge, it drifts.
This reinforces a familiar conclusion:
The value in agentic AI is not in the model itself.
It is in the workflow and the data surrounding it.
3. Evaluation becomes a first-class system
If you cannot evaluate an agent reliably, you cannot trust it.
Which means evaluation must evolve alongside capability.
Not as an afterthought.
But as infrastructure.
Conclusion — The quiet risk
Most systems fail loudly.
This one fails quietly.
It produces plausible explanations. It matches prior work. It appears competent.
And yet, under the surface, it may not understand what it is explaining.
That is the real risk of agentic AI.
Not that it is wrong.
But that it is convincingly right for the wrong reasons.
Cognaptus: Automate the Present, Incubate the Future.