Same Maps, Different Moves: Why LLMs Can Converge Without Understanding

Meetings are useful theatre. Everyone can nod at the same slide, repeat the same market keywords, and still leave the room with incompatible plans. The agreement was real. The shared understanding was not.

Large language models may be doing something uncomfortably similar.

The paper Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning studies whether models that look similar internally are actually reasoning in similar ways.¹ This matters because a tempting story has been building around representational convergence: as models scale, their internal representations become more alike, perhaps because they are converging toward a shared statistical model of reality. That story is elegant. It is also a little too convenient, which is usually where expensive mistakes begin.

The paper’s central correction is precise: cross-model representational convergence does not automatically mean cross-model reasoning convergence. The models may align while encoding the input, diverge while producing the answer, and even contain decodable correctness information that barely affects their final prediction. In business terms, two systems can inspect the same case file in a similar way and still make different decisions for different reasons. Similar intake is not similar judgment.

This is not a paper about LLMs being useless. It is more annoying than that. It says some of our favorite diagnostic shortcuts may be pointing at the wrong layer of the problem.

The useful question is not whether models become similar, but where similarity lives

The background concept is the Platonic Representation Hypothesis: different models, trained under different conditions, may develop increasingly similar internal representations as they scale. In the optimistic version, this suggests that capable models are converging toward shared structure in the data-generating world. If that were broadly true, it would support a comfortable bundle of assumptions: interpretability findings might transfer across models, ensembles might be designed around representational diversity, and model similarity metrics might say something deep about shared understanding.

The authors do not dismiss representational convergence. They narrow it.

They study 16 language models from 8 families, ranging from 1.5B to 72B parameters. The main cohort contains 14 instruction-tuned models; two larger models, LLaMA-3.1-70B and Qwen-2.5-72B, are used for scale validation. The task set contains 800 reasoning problems, 200 each from GSM8K, ARC-Challenge, TruthfulQA, and HellaSwag. The paper records hidden states across layers and compares models using Centered Kernel Alignment, or CKA, with mutual nearest neighbors and SVCCA used as complementary checks.

The important design choice is not the metric itself. It is the stratification. The authors do not ask only whether representations are similar. They ask whether similarity changes by:

Stratification axis	What it tests	Why it matters
Problem difficulty	Do models converge more when they succeed or when they fail?	Distinguishes shared understanding from shared confusion.
Computational stage	Do models align before or after the decision point?	Separates input encoding from answer generation.
Causal relevance	Does shared information actually affect predictions?	Separates readable features from operational mechanisms.

That third column is the whole article. A business system does not need a model that merely contains a signal somewhere in its residual stream, like a receipt stuffed into the wrong drawer. It needs the signal to influence the decision reliably.

Failed problems can make models look more alike

The first result is the paper’s cleanest slap in the face: models are more representationally similar on problems they collectively fail than on problems they collectively solve.

In the main difficulty analysis, hard problems are those answered correctly by 0 to 4 of the 14 core models. Easy problems are those answered correctly by 10 to 14 models. The hard set produces mean CKA of 0.897. The easy set produces mean CKA of 0.830. Mutual nearest neighbors confirm the direction: 0.645 for hard problems versus 0.584 for easy ones.

Under a naive convergence-as-understanding story, this is backwards. One would expect models to align most when they successfully capture the structure of the problem. Instead, they align more when they fail. Apparently, confusion can be wonderfully synchronized.

The paper calls this difficulty inversion. The result also survives a useful objection: maybe models look similar simply because they give the same answers. The authors condition on answer agreement and find the opposite. Model pairs giving different answers show higher CKA than pairs giving the same answer. Both-wrong pairs also show especially high similarity. That does not look like shared competence. It looks like shared pre-decision processing followed by divergent or weakly grounded output behavior.

The domain breakdown is important. The inversion appears in science, truthfulness, and commonsense reasoning. Mathematics is the exception: GSM8K follows the expected direction, with harder math problems producing lower CKA and easier problems producing higher CKA. The authors interpret this as a sign that mathematical reasoning is more algorithmically constrained. There are fewer valid ways to carry out certain arithmetic procedures, so successful models may be forced into more similar strategies.

This exception should not be brushed aside as a nuisance. It is a boundary condition. In domains with constrained solution paths, representational convergence may better track successful computation. In domains with many plausible reasoning paths, successful models can diverge while failed models collapse into similar uninformative patterns. That distinction matters for enterprise use because most business reasoning is not long division. Policy interpretation, contract review, customer prioritization, credit-risk narratives, procurement decisions, and executive summaries all allow multiple paths from evidence to conclusion. Some of those paths are good. Some merely sound employed.

Models agree before they decide, then diverge when the answer is made

The second finding explains why the first one is possible. The authors split each model’s layers into pre-decision and post-decision stages, using the point at which a correctness probe first exceeds chance accuracy as the stage boundary. Pre-decision layers show high representational alignment: CKA of 0.875. Post-decision layers collapse to 0.274. The gap is 0.601, and 89 of 91 model pairs show a pre-post gap above 0.40.

This is the paper’s mechanism-first turning point. Similarity is concentrated before the model commits to an answer.

A useful way to read this is through the distinction between intake and judgment. During intake, models process the same prompt, the same tokens, and the same surface structure. Because they share transformer architecture and exposure to similar language statistics, their hidden representations can align. During judgment, they must select premises, weight cues, retrieve procedures, suppress distractors, and prepare the next-token path. That is where model-specific computation becomes visible.

This is why aggregate similarity metrics can be misleading. If a metric averages across the wrong stage, it can report convergence while hiding the fact that the action-relevant part of the computation has already split apart. The models agree in the lobby. The boardroom is chaos, but with better GPUs.

The implication for interpretability is immediate. Interpretability findings about input-processing circuits may transfer more plausibly across architectures. Findings about late-layer reasoning circuits are a different animal. If post-decision representations diverge sharply, then a mechanism discovered in one model may not be a safe explanation for another model’s final reasoning behavior, even when the two models look similar at earlier layers.

Correctness can be readable without being used

The third finding is subtler and more operationally dangerous. The paper trains transfer probes: a linear classifier trained on one model’s hidden states to predict correctness, then evaluated on another model’s hidden states. The probe achieves 66% accuracy, above a 55% permutation baseline and slightly above the 62.9% majority-class baseline. The signal is not huge, but it is consistent: it exceeds the majority-class rate for 78 of 91 model pairs.

So correctness-related information is shared enough to be read across models. That sounds encouraging, until the ablation arrives.

The authors then identify a correctness-predictive subspace and ablate it by projecting activations onto the orthogonal complement. If this shared correctness information were causally important, removing it should often flip predictions. The main flip rate is only 1.5%. Under an expanded protocol using problems where at least 10 of 14 models are correct, the flip rate rises to 5.5%, with tested model rates ranging from 3.0% to 7.0%.

That is the paper’s most useful warning for practical AI evaluation: decodability is not deployment. A model can contain information that a probe can read, while the model itself barely uses that information to decide.

This is not a pedantic distinction. Many AI governance workflows implicitly treat “the model encodes X” as evidence that “the model uses X.” That leap is unsafe. A diagnostic feature can be present, statistically extractable, and still operationally weak. A dashboard that detects a correctness-like signal in hidden states may be informative, but it is not automatically a control mechanism.

The head-level ablation adds another layer. When the authors ablate individual attention heads, some heads have much larger effects: maximum flip rates reach 43% to 63% in multi-head attention models and about 20% in grouped-query attention models. But the paper is careful: zeroing a whole head disrupts broad computation, so these rates reflect general head importance rather than correctness-specific causality. The useful point is architectural: full-hidden-state CKA can miss head-level computational differences that matter for behavior.

In short, the shared feature may be non-causal, while the causal feature may be architecture-specific. Lovely. Exactly the kind of neat interpretability shortcut executives should not be sold.

Diffuse attention explains why shared failure looks like convergence

The authors propose a mechanism for the difficulty inversion: hard problems produce more diffuse, higher-entropy attention. When attention becomes diffuse, heads aggregate information more uniformly across tokens. That makes representations less model-specific and more alike across architectures. The paper reports a consistent relationship across six tested models: harder problems produce higher-entropy attention patterns, with the average entropy-difficulty correlation summarized as Pearson $r = 0.43$ in the main findings table.

The intuition is simple. On solvable problems, models may focus on different decisive cues and develop model-specific computational paths. On unsolvable problems, attention becomes less selective. The model is not converging on truth. It is failing in a generic way.

The random-model baseline sharpens the argument. Randomly initialized models show higher CKA than trained models: 0.864 versus 0.612 in the paper’s summary. This means that some apparent convergence is not learned wisdom. It is architectural commonality. Transformers share causal masking, residual streams, layer normalization, and related design constraints. Training then differentiates models as they develop specialized circuits.

The embedding-layer baseline cuts in the opposite direction: embedding layers show near-zero CKA, while mid-layers reach 0.944 and deep layers 0.917. So the convergence is not simply tokenizer overlap. It emerges through transformations inside the network. But it emerges against a strong architectural backdrop, not necessarily because every model has climbed the same mountain of truth and planted the same tiny flag.

The cleanest interpretation is this: language models converge more reliably on shared perception of the input than on shared reasoning over the input. Perception-like processing is not trivial. It is still useful. But it is not the same as decision logic.

The appendix mostly stress-tests the mechanism, not a second thesis

The appendix is useful because it tells us which parts of the argument are robust and which parts remain method-sensitive.

Appendix test	Likely purpose	What it supports	What it does not prove
Sample-size stability	Robustness check	Difficulty inversion is stable for subsets of 100+ problems per domain; 50-problem subsets lose power in TruthfulQA.	It does not eliminate the need for broader benchmarks.
Kernel CKA	Metric sensitivity test	The inversion holds under centered, uncentered, and RBF kernel CKA; kernel CKA gives a larger hard-easy gap.	It does not make CKA a causal metric.
Probe regularization	Ablation/sensitivity test	Transfer accuracy ranges from 63% to 68%, while ablation flip rate stays below 2%.	It does not prove nonlinear correctness features are irrelevant.
Prompt sensitivity	Robustness check	The inversion persists under zero-shot direct, chain-of-thought instruction, and few-shot prompting; standard deviation across formats is below 4%.	It does not cover agentic tool-use workflows.
MNN and SVCCA validation	Comparison across similarity metrics	Alternative metrics confirm the inversion and the mathematics exception.	It does not resolve all known limitations of representation similarity measures.
Answer-agreement analysis	Alternative-explanation test	Different-answer pairs show higher CKA than same-answer pairs in 83 of 91 model pairs.	It does not fully explain why specific errors occur.
Expanded causal ablation	Power and scope extension	Flip rates remain low, 3% to 7% across tested models, under a broader high-agreement set.	It still depends on a linear correctness-subspace assumption.

This is a disciplined appendix. It does not magically remove every limitation. It does make the main story harder to dismiss as a one-metric artifact.

The base-model control is especially relevant. The authors repeat the difficulty inversion analysis on four non-instruction-tuned base models. The inversion not only remains; it becomes stronger. Hard problems where 0 of 4 base models are correct show CKA of 0.962, while easy problems where all 4 are correct show CKA of 0.827. The monotonic pattern across bins suggests instruction tuning is not causing the phenomenon. If anything, alignment training may attenuate it.

For practitioners, that matters because it blocks a lazy explanation: “This is just RLHF weirdness.” No. The paper’s evidence points deeper, toward transformer representational dynamics and the difference between input processing and answer computation.

What the paper directly shows, and what Cognaptus infers for business use

The paper directly shows three dissociations across its tested setting: difficulty inversion, generation-stage divergence, and epiphenomenal correctness. It also gives mechanistic evidence that diffuse attention and architectural baselines help explain why raw convergence numbers can overstate shared reasoning.

Cognaptus infers three business implications from that evidence.

First, model selection should not rely on representational similarity as a proxy for reasoning diversity. If two models have similar internal representations during input processing, they may still diverge at output generation. Conversely, if they look highly similar on hard cases, that may reflect shared confusion rather than robust agreement. For ensemble design, the better operational signal is behavioral diversity under controlled task slices: where do models disagree, what kinds of errors do they make, and do those errors decorrelate on cases that matter?

Second, interpretability transfer should be stage-aware. A finding about input processing may travel across models. A finding about late-layer reasoning or decision preparation should be treated as model- and architecture-sensitive until tested. This is particularly important as deployed systems mix different attention architectures, including multi-head attention and grouped-query attention. A hidden-state-level similarity score can miss head-level causal differences.

Third, AI safety and audit workflows should distinguish “encoded,” “extractable,” and “causally used.” These are not synonyms. A probe may extract correctness-related information, while ablation shows the model barely depends on it. A compliance monitor that detects a signal is not automatically a steering mechanism. Detection is not control. Enterprises regularly forget this, usually right before procurement discovers the invoice has a personality.

A practical framework looks like this:

Business question	Better diagnostic	Why raw similarity is insufficient
Which models should we ensemble?	Output-level error decorrelation by task type and difficulty	High representation similarity may reflect shared input encoding or shared failure.
Can an interpretability finding transfer?	Layer/stage-specific validation plus architecture checks	Pre-decision alignment does not imply post-decision alignment.
Does the model use a safety-relevant signal?	Causal intervention, counterfactual testing, or controlled ablation	Probe decodability does not prove causal deployment.
Are two models “reasoning the same way”?	Process tracing across stages, not one aggregate CKA score	Similar hidden states can hide divergent generation paths.
Are hard-case agreements reassuring?	Error taxonomy and independent verification	Hard cases may create homogenized representations through diffuse attention.

This does not mean representation metrics are useless. It means they should be treated as diagnostic instruments with a limited field of view. A thermometer is useful. It is not a business strategy, and it should not be asked to approve a loan.

Where the result applies, and where it should not be overextended

The evidence is broad enough to be taken seriously but not universal enough to be converted into doctrine.

The model set spans 1.5B to 72B parameters, but not frontier proprietary systems at much larger or undisclosed scales. The tasks cover math, science, truthfulness, and commonsense reasoning, but not long-horizon enterprise workflows with tools, retrieval, memory, planning, and human feedback loops. The causal ablation assumes a linear correctness subspace; nonlinear mechanisms could behave differently. The attention-entropy account is correlational, not a direct causal intervention. CKA itself has known sensitivities, although the paper mitigates this with alternative metrics and robustness checks.

The mathematics exception is also a real boundary. In algorithmically constrained domains, successful models may converge more because the solution space is narrow. That means the paper should not be read as “similar representations never matter.” The better rule is conditional: similarity is more meaningful when the domain forces a narrow set of valid computational strategies; it is less meaningful when many plausible reasoning paths exist.

For business AI, that distinction is valuable. Spreadsheet calculation, formal verification, symbolic transformation, and narrow procedural tasks may benefit from convergence-based assumptions more than legal interpretation, customer support escalation, market narrative analysis, or policy summarization. Different tasks deserve different diagnostic instruments. This is not glamorous. It is also cheaper than discovering after deployment that your “diverse” ensemble is a committee of models making the same confused face.

The safer lesson: inspect the route, not just the resemblance

The paper’s title, Convergence Without Understanding, is not just a warning about a technical metric. It is a warning about an organizational habit. Businesses like proxies: benchmark scores, similarity measures, dashboards, compliance labels, risk badges. Proxies are convenient because they compress uncertainty. They are dangerous when the compression removes the mechanism.

This paper shows that LLMs can share representations without sharing reasoning, encode correctness without using it, and appear more aligned precisely where they fail. The mechanism is not mystical. It is staged: shared input encoding, divergent generation, diffuse attention on hard problems, architectural baselines, and weak causal deployment of correctness-like signals.

That staged view is the practical contribution. It tells evaluators where to look:

Separate input processing from answer generation.
Test hard cases differently from easy cases.
Treat probe-readability as evidence, not control.
Validate interpretability transfer at the relevant layer and architecture.
Build ensembles around behavioral error diversity, not aesthetic internal dissimilarity.

Representational convergence may still be real. It may still be useful. But it is not a certificate of shared understanding. In enterprise AI, that should be the default assumption anyway. When two systems agree internally and disagree operationally, the safe response is not philosophical excitement. It is process control.

The models may have the same map. They are still taking different roads.

Cognaptus: Automate the Present, Incubate the Future.

Muhammad Usama and Dong Eui Chang, “Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning,” arXiv:2605.23315, 2026. https://arxiv.org/html/2605.23315 ↩︎

The useful question is not whether models become similar, but where similarity lives#

Failed problems can make models look more alike#

Models agree before they decide, then diverge when the answer is made#

Correctness can be readable without being used#

Diffuse attention explains why shared failure looks like convergence#

The appendix mostly stress-tests the mechanism, not a second thesis#

What the paper directly shows, and what Cognaptus infers for business use#

Where the result applies, and where it should not be overextended#

The safer lesson: inspect the route, not just the resemblance#