Audit has a boring rule that AI teams keep trying to make exciting: a correct-looking answer is not the same as a trustworthy process.

That rule becomes awkward when the answer is an explanation of another AI system. If an AI agent can inspect a model, run experiments, and produce a plausible explanation of what a circuit component does, it feels like a research assistant has arrived. If that explanation matches a published human analysis, the temptation is obvious: declare progress, write the benchmark table, and proceed to the next demo.

The paper behind this article, Pitfalls in Evaluating Interpretability Agents, is useful because it refuses that temptation.1 It builds an agentic system for automated circuit analysis, shows that the system can look competitive when evaluated against prior human explanations, and then spends most of its intellectual effort asking whether that success means what we think it means.

The answer is not “the agent is useless.” That would be too easy, and also wrong. The answer is more uncomfortable: replication-based evaluation can make an interpretability agent look like it understands, even when the evaluation cannot separate genuine experimental reasoning from ambiguity, memorization, or informed guessing.

That is the mirage. The agent may not be hallucinating. The benchmark may not be fake. The published reference may not be worthless. Yet the overall evaluation can still overstate what has actually been demonstrated. Conveniently, this is exactly the kind of problem that governance teams discover after procurement, not before. Excellent timing, as always.

The paper compares two scorecards, not just two systems

The obvious way to read the paper is as a study of an interpretability agent. That is only half right.

The more useful reading is comparative: the paper contrasts two kinds of evaluation logic.

Evaluation logic What it asks What it rewards What it can miss
Replication-based evaluation Does the agent’s final explanation match a published human explanation? Agreement with known expert labels Ambiguous labels, memorized answers, weak research process
Process-aware evaluation Did the agent design meaningful experiments and update hypotheses from evidence? Research behavior, not just final text Harder to automate and score consistently
Behavior-grounded intrinsic evaluation Do components grouped together behave similarly under intervention? Functional coherence inside the model Limited scope; does not replace human interpretation

This distinction matters because the paper’s central result is a reversal. At first, the agentic system appears competitive. Then the authors inspect what the evaluation is actually measuring, and the picture becomes less flattering to the benchmark rather than simply less flattering to the model.

That is the important move. The paper is not saying, “agents fail.” It is saying, “our usual way of deciding whether agents succeed may be too shallow.”

For business readers, this is the difference between testing whether an AI system can produce the right slide and testing whether it has followed a defensible analytical workflow. The first is cheap. The second is what matters when the slide becomes part of a risk report, model audit, investment memo, or regulatory file.

What the agent actually does

The technical setting is automated circuit analysis. In mechanistic interpretability, a “circuit” is a set of model components—often attention heads or MLPs—that jointly support a task. Human researchers try to explain the role of those components: which token a head attends to, what information it writes into the residual stream, whether it helps copy a name, identify a year, suppress an incorrect answer, or transmit positional information.

The paper focuses on a specific stage of this pipeline. It does not ask the agent to discover the circuit components from scratch. The components are assumed to be known. The task is to assign functional interpretations to those components and cluster components with similar roles.

The system has three stages.

First, a researcher specifies the task and provides the circuit to analyze. Second, a Claude Opus 4.1-based research agent analyzes each component independently. It receives task prompts, uses interpretability tools, proposes hypotheses, asks for further experiments, and stops when it produces a final hypothesis. Third, another LLM step clusters the component-level descriptions into shared functional groups.

The agent’s tools are standard in this kind of work: attention patterns, logit lens, activation patching, and token-position inspection. The important part is not that these tools exist. The important part is that the agent can choose how to use them. It is not merely reading a static data dump; it can decide to test a hypothesis on new prompts or compare a clean prompt against a counterfactual.

That makes the system closer to a junior researcher than a summarizer. A junior researcher with an unusually expensive memory and no body, but still.

The first scorecard says the agent looks competitive

The paper evaluates the system across six prior circuit-analysis tasks from the literature:

Task Model Circuit size Reported clusters
Indirect Object Identification GPT-2-Small 18 heads 6
Indirect Object Identification Pythia-160M 14 heads 7
Greater-Than GPT-2-Small 8 heads, 4 MLPs 2
Acronyms GPT-2-Small 8 heads 3
Colored Objects GPT-2-Medium / GPT-2-XL task family as reported in the paper context 27 heads 3
Entity Tracking LLaMA-7B 64 heads 4

The evaluation uses three metrics. Component Functionality Accuracy asks whether individual component explanations map to the correct human-labeled cluster. Cluster Functionality Accuracy asks whether the agent’s cluster explanations map to the right expert cluster. Component Assignment Accuracy asks whether the grouping of components aligns structurally with expert-defined clusters, using optimal matching because cluster names are arbitrary.

The authors also compare the agentic system with a one-shot baseline. The baseline receives precomputed experimental outputs and produces explanations in a single pass. It cannot iteratively refine hypotheses or design follow-up experiments.

The result is deliberately inconvenient: the agentic system is competitive, but greater autonomy does not consistently improve the final benchmark scores. Across the six tasks, both systems often achieve moderate to high agreement with expert explanations, yet neither perfectly matches the human analyses. The agent made 6,382 tool calls in total, averaging about 14.2 calls and 4.5 iterations per component analysis, but the final-score comparison does not consistently reward that extra research behavior.

This is where a weaker article would say, “agentic AI is overhyped.” That misses the better point. The agent may indeed be doing useful research-like work. The problem is that the chosen scorecard is not sensitive enough to distinguish useful research-like work from a competent one-shot answer.

A benchmark that cannot tell the difference between exploration and answer matching is not a benchmark of research ability. It is a benchmark of final-answer resemblance. That is not nothing. It is also not what the marketing brochure will call it.

The first crack: expert labels are not ground truth, only expensive hypotheses

Replication-based evaluation assumes that published human explanations are the reference answer. That assumption is practical. It is also fragile.

The paper shows why. In the Pythia-160M IOI task, one attention head had been labeled as a previous-token head. The agent’s experiments did not consistently reveal that behavior. The authors then tested the head more broadly on 150 examples from the PILE dataset. The previous token received the highest attention score in only 42% of cases.

That number does not automatically prove the original label was wrong. It does something subtler: it shows that a single functional label may be too clean for a behavior that is input-dependent. If a component behaves like a previous-token head in some contexts but not most others, then “previous-token head” is not a universal fact. It is a useful compression, and compressions lose information. Interpretability labels are not magical just because a human wrote them in a paper.

The appendix makes the same point more forcefully for the Entity Tracking task. A large group of heads had been labeled as value fetcher heads, supposedly retrieving the object associated with a queried box. The authors re-examined this group across 500 examples using attention analysis, logit lens analysis, and counterfactual patching. Some heads behaved poorly under that label: for example, several attended to the correct object only a minority of the time, or assigned high probability to incorrect objects.

This is not a small editorial wrinkle. If the “gold standard” is ambiguous, then disagreement with it is not automatically failure. Sometimes the agent is wrong. Sometimes the expert label is incomplete. Sometimes both are pointing at different abstractions of the same mechanism.

For business evaluation, this is familiar. Ground truth in enterprise AI is often a spreadsheet assembled by busy people, not a divine artifact. Compliance categories, customer-intent labels, procurement-risk tags, fraud rationales, medical notes, sales-stage definitions: many are partly subjective. Scoring an agent against them may tell you whether it conforms to institutional language, not whether it has discovered the best explanation.

The second crack: final-answer metrics erase the research process

The agentic system’s strongest advantage is not necessarily its final text. It is the path it can take.

In the example run, the agent analyzes an attention head from the IOI task, proposes multiple hypotheses, and designs targeted experiments to test whether the behavior depends on giving verbs, name order, multiple people, or more general social-interaction patterns. That is research behavior. It is not merely “summarize the table.”

Yet the extrinsic evaluation mostly compares final explanations against published descriptions. Under that setup, an agent that runs follow-up experiments and a one-shot system that makes a plausible inference from static outputs can look similar.

This is the paper’s most business-relevant point. Agentic systems are often sold on process: they plan, inspect, call tools, revise, and verify. But many evaluations still score only the final answer. That creates a mismatch between the capability being purchased and the capability being tested.

A useful evaluation of an agent should therefore preserve traces of reasoning behavior. Not hidden chain-of-thought theater, but auditable operational traces: which tools were called, which hypotheses were tested, which evidence caused a revision, which edge cases were explored, and which uncertainty remained unresolved.

For interpretability agents, that trace is part of the output. For enterprise agents, it should be part of the control surface.

The third crack: memorization can wear a lab coat

The most uncomfortable issue is memorization.

The agent uses Claude Opus 4.1. The judge uses GPT-5. Both are large models trained on broad corpora. The paper asks a simple question: when the system appears to reproduce published circuit findings, is it reasoning from experimental evidence, or is it recalling what it has seen before?

The authors directly prompted Claude to list circuit components and their functionalities without running experiments. For the IOI circuit, Claude could recall the exact components and their attributed roles. GPT-5, used as a judge, also showed signs of direct recall for IOI.

This matters because memorization did not announce itself politely. Claude did not necessarily say, “Ah yes, I remember this paper.” Most generated explanations did not simply copy original terminology. Performance was also not perfect, which makes the behavior easier to miss. A model can remember enough to bias the result while still making errors. Apparently even memorization now has plausible deniability.

The Entity Tracking case adds another wrinkle. Claude did not explicitly report familiarity with the paper’s conclusions, but it could still produce a high-level characterization broadly consistent with the published interpretation from minimal cues. That may be informed guessing rather than memorization. For evaluation purposes, the distinction matters less than one might hope. In both cases, the agent’s apparent discovery may be partly coming from prior exposure or a small hypothesis space rather than fresh experimental reasoning.

This is especially dangerous for research-agent benchmarks. Published tasks are finite. Famous tasks are more likely to appear in training data. The better-known the benchmark, the more likely the model has seen clues. The benchmark becomes a closed-book exam administered to someone who may have memorized last year’s answer key. Not ideal. Very modern.

The noise test narrows the accusation without fully clearing the suspect

The paper does not stop at “maybe memorization.” It runs a noise sensitivity analysis to see whether the systems actually rely on experimental evidence.

The design is straightforward. The authors inject noise into the outputs of the tools the agent uses. For logit lens, they permute probability vectors. For patching, they permute token-wise probability differences. For attention patterns, they permute attention distributions. The noisy result is constructed as:

$$ (1 - \alpha) \cdot \text{original_result} + \alpha \cdot \text{permuted_result} $$

where $\alpha$ controls the amount of noise.

If the systems were merely recalling the IOI circuit, performance should remain relatively stable even as tool outputs become corrupted. If they depend on experimental evidence, performance should degrade as noise increases.

The result is mixed in an informative way. At high noise levels, both the agentic and one-shot systems show substantial performance degradation. That suggests experimental evidence matters. At lower noise levels, degradation is limited, which could reflect memorization, robustness to noise, or the ability to recover structure despite partial corruption.

So the correct interpretation is not “it was all memorization.” The paper is more precise than that. Memorization alone cannot fully explain the behavior across noise regimes. But memorization and informed guessing remain serious enough that replication-based evaluation needs explicit contamination checks.

That nuance is useful for business practice. Many AI audits are framed as binary: the system either reasons or hallucinates, either retrieves or understands, either uses evidence or fakes it. Real systems are messier. They can combine partial memory, pattern recognition, weak evidence, strong priors, and genuine tool use in the same answer.

Evaluation has to be designed for that mixture.

The intrinsic metric asks components to prove they are functionally similar

After diagnosing the pitfalls of human-label replication, the paper proposes a proof-of-concept intrinsic evaluation. The goal is to evaluate cluster quality without relying on expert explanations.

The core idea is elegant. If two attention heads implement the same function, then swapping one head’s internal circuitry with the other should not change the model’s behavior very much. If the swap changes predictions substantially, the heads may not be functionally interchangeable.

The authors focus on attention heads and separately swap two parts of their circuitry: KQ circuits, which influence where a head attends, and OV circuits, which influence what information a head writes. They then measure the Jensen-Shannon distance between the model’s original next-token distribution and the distribution after the swap. The distance between two heads is defined as:

$$ \text{dist}(h_1, h_2) = \frac{1}{2}\left(\sqrt{JSD_{KQ}(h_1, h_2)} + \sqrt{JSD_{OV}(h_1, h_2)}\right) $$

Using this distance matrix, they compute silhouette scores for different clusterings: expert clusters, agentic clusters, one-shot clusters, and random clusters. A better clustering should place functionally similar heads together and dissimilar heads apart.

The results are modest but meaningful. Random clusterings produce negative silhouette scores, as expected. Expert clusters score higher. Agentic and one-shot systems average around zero or slightly above random, but some individual runs achieve relatively high scores. The intrinsic metric also tends to correlate positively with component assignment accuracy in most tasks, though the IOI Pythia-160M case is a clear exception.

This is not a finished evaluation solution. It is more like a useful instrument on a dashboard. It cannot tell the whole story, and it does not cover all component types or all interpretability claims. But it does change the direction of evidence. Instead of asking only whether the explanation resembles a human label, it asks whether the grouped components behave like they belong together.

That is a better question. Still imperfect, but better.

What each experiment is really doing

The paper is easy to misread if every result is treated as another “finding” of equal status. The experiments have different roles.

Paper element Likely purpose What it supports What it does not prove
Six-task extrinsic benchmark Main evidence The agentic and one-shot systems can often approximate published explanations That the agent genuinely reasoned, or that autonomy reliably improves outcomes
One-shot baseline comparison Comparison with prior/simple workflow Final-answer scores may not reward iterative research behavior That one-shot analysis is always as good as agentic analysis
Pythia previous-token recheck Ambiguity diagnostic Some expert labels may be incomplete or context-dependent That all human circuit labels are unreliable
Entity Tracking appendix analysis Deeper ambiguity audit Several “value fetcher” heads do not consistently match the published label That the agent’s alternative label is always better
Direct memory prompts Memorization diagnostic Claude and GPT-5 show signs of recalling at least IOI-related findings That all results are memorized
Noise sensitivity analysis Robustness/sensitivity test Tool evidence affects performance, especially at high noise A clean separation between reasoning and memorization
Swap-based silhouette metric Exploratory intrinsic evaluation Behavior-grounded cluster quality can complement human-label replication A complete replacement for expert evaluation

This table is the article’s practical hinge. The paper is not one big scoreboard. It is a sequence of pressure tests applied to a tempting evaluation method.

The business lesson: evaluate agents as workflows, not answer machines

For companies building or buying agentic AI systems, the paper points to a simple operational rule: do not evaluate an agent only by whether its final answer matches a reference answer.

That may be adequate for narrow extraction tasks. It is not adequate for open-ended analytical systems that claim to investigate, interpret, diagnose, or explain.

A stronger evaluation stack should separate four questions.

Evaluation question Practical control Why it matters
Did the final answer match a trusted reference? Output-level benchmark Useful baseline, but vulnerable to label ambiguity and memorization
Did the agent gather and use relevant evidence? Tool-call logs, experiment plans, source traces Distinguishes investigation from fluent guessing
Did the agent revise hypotheses when evidence changed? Process audit, adversarial evidence, counterfactual tests Tests whether the agent is responsive to evidence rather than anchored to priors
Does the claimed structure behave coherently? Intrinsic or behavioral validation Reduces dependence on subjective labels
Could the system have known the answer from training or retrieval? Contamination checks, holdout tasks, private test sets Prevents “research” from becoming recall with better formatting

This applies beyond interpretability.

In legal research, an agent that matches a known memo may not have reasoned through the case law. In financial analysis, an agent that reproduces a market narrative may be echoing public commentary. In compliance, an agent that assigns the expected risk category may be conforming to a flawed policy label. In scientific discovery, an agent that replicates a published finding may be drawing from latent memory rather than experiment.

The pattern is the same: final agreement is useful, but not sufficient.

For AI governance teams, this means evaluation must become infrastructure. Not a one-time benchmark. Not a decorative “human in the loop.” Infrastructure means repeatable tests, trace capture, contamination screening, process review, and behavior-grounded validation where possible.

The annoying truth is that this makes agentic AI less plug-and-play than vendors prefer. The more autonomy a system has, the more evaluation surface it creates. If the agent can choose tools, design experiments, and update its own plan, then the evaluation must inspect those choices. Otherwise, autonomy becomes a branding layer wrapped around unverified behavior.

Where this paper stops short

The paper’s boundaries are important.

First, the circuit components are assumed to be known. The system is not solving the full interpretability pipeline from discovery to explanation. It is analyzing already identified components. That makes the problem narrower, although still difficult.

Second, the tasks are drawn from well-known circuit-analysis studies. That is necessary for comparison, but it also increases the risk of memorization and informed guessing. The paper recognizes this. Future evaluations need private or newly generated tasks where training exposure is much less likely.

Third, the intrinsic metric focuses on attention-head interchangeability through KQ and OV swaps. This is a meaningful behavior-grounded idea, but it does not cover every type of component, every kind of mechanism, or every level of explanation. A cluster can be behaviorally coherent under one intervention and still be poorly described in natural language.

Fourth, silhouette scores are a cluster-quality signal, not a full audit of explanation faithfulness. They can help compare runs and detect coherent groupings. They do not certify that the English explanation attached to the cluster is correct.

These limitations do not weaken the paper. They keep it honest. The contribution is not a universal evaluation framework. It is a careful demonstration that the comfortable framework we might have reached for first—replication against published explanations—is not strong enough on its own.

The better benchmark is less flattering, which is why it is useful

The most useful benchmarks are often the least flattering ones.

A flattering benchmark says the agent matches prior work and therefore looks intelligent. A useful benchmark asks whether the prior work is unambiguous, whether the agent used evidence, whether it might have memorized the answer, whether its process is auditable, and whether its proposed structure survives behavioral tests.

That is a less convenient story. It is also the story serious AI deployment needs.

The paper’s broader lesson is not limited to interpretability agents. As AI systems move from answering questions to conducting workflows, evaluation has to move from answer matching to process and behavior. Otherwise, we will keep rewarding systems for producing familiar conclusions while failing to ask how they got there.

The mirage of understanding is not that the agent is always wrong. It is that the agent can be convincingly right for reasons the benchmark cannot see.

For a chatbot, that is irritating. For an AI system used to audit another AI system, it is a governance problem wearing a lab coat.

Cognaptus: Automate the Present, Incubate the Future.


  1. Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, and Yonatan Belinkov, “Pitfalls in Evaluating Interpretability Agents,” arXiv:2603.20101, 2026. https://arxiv.org/abs/ ↩︎