No Structure, No Glory: Why AI Cognition Has to Be Shown, Not Named

TL;DR for operators

AI systems are now sold with labels that sound increasingly cognitive: reasoning, planning, agency, memory, autonomy, sometimes even the more theatrical hints of machine consciousness. Lovely. The marketing department has discovered philosophy.

The useful question is not whether the label feels exciting. It is whether the system realizes an internal organization that could actually support the claimed capability.

Two recent papers make that point from very different levels. One studies how Transformers trained only on local pairwise comparisons can form a rank-aligned internal geometry that supports transitive inference.¹ The other develops a formal account of what would be required for a simulated or artificial system to realize consciousness-relevant causal-computational organization, rather than merely imitate behavior at the boundary.²

Read together, they form a clean logic chain:

Step	Question	What the papers contribute
1	Can a model’s output hide the real mechanism?	Yes. Accuracy alone can saturate while internal structure reveals how generalization is achieved.
2	What kind of structure matters?	For ordinal reasoning, a low-dimensional rank geometry can carry the relevant relation.
3	Is this structure universal?	No. In a pretrained LLM, ordinal structure appears domain by domain, not as one magical “ordinality” switch.
4	What happens when the claim becomes larger, such as artificial consciousness?	Boundary behavior becomes even less adequate. The relevant question becomes whether the system realizes intrinsic causal-computational organization.
5	What should operators do with this?	Demand evidence of structure, stability, intervention sensitivity, and boundary adequacy before accepting cognitive labels.

This is not an article about whether today’s AI is conscious. It is about something more useful and less likely to embarrass us later: how to stop confusing outputs with organization.

The problem now: AI labels are getting cheaper

A few years ago, calling an AI system “intelligent” was already enough to cause mild conceptual indigestion. Now systems are “agentic,” “reasoning,” “self-improving,” “autonomous,” “memory-augmented,” “reflective,” and occasionally “proto-conscious,” depending on how allergic the vendor is to restraint.

The business problem is straightforward. Organizations are being asked to make procurement, governance, automation, and risk decisions based on claims that sit somewhere between technical evidence and brand positioning. A model passes a benchmark, gives a fluent explanation, solves a demo task, or behaves as though it has a plan. From there, people infer a capability.

Sometimes that inference is reasonable. Often it is just theatre with an API key.

The papers in this cluster are useful because they push the evaluation question inward. They ask, in different ways: what organization inside the system supports the behavior?

That question sounds academic until it is not. In production, output quality can degrade under distribution shift. A model can pass a test without learning the intended rule. An agent can appear coordinated while relying on brittle prompt patterns. A simulation can resemble a thing without realizing the structure that makes the thing what it is. The difference between “works on the demo” and “has the relevant organization” is where expensive mistakes like to live.

The first paper gives us a tractable empirical case: ordinal reasoning in Transformers. The second gives us a much broader realization criterion: when should an artificial or simulated system count as realizing a consciousness-relevant organization rather than merely simulating it? The distance between those topics is large, but the bridge is precise.

Do not evaluate cognitive claims only at the boundary. Extract the structure.

The empirical anchor: a model learns a line, not just answers

The first paper, Emergent Ordinal Geometry in Transformers Trained on Local Comparisons, studies transitive inference. The task is familiar: if the system learns that $A < B$ and $B < C$, can it infer that $A < C$ without seeing that comparison directly?

The authors train small Transformers only on adjacent comparisons in a hidden total order. The model sees local relations, not distant pairs. Evaluation happens on held-out distant comparisons. So success cannot be simple memorization of every pair. The system must reconstruct the global order from local evidence.

The interesting part is not merely that the model generalizes. The interesting part is how.

The paper reports that successful out-of-distribution generalization coincides with a geometric reorganization of the entity embeddings. The embeddings collapse toward a one-dimensional manifold whose principal axis recovers the hidden rank order. In plain business English: the model does not just learn isolated facts. It organizes the items along an internal line.

That matters because behavior alone would miss the mechanism. Once accuracy is high, a standard dashboard has little left to say. “Correct” and “also correct” are not exactly an interpretability program. The authors therefore examine confidence and geometric separation. They find that both increase with rank distance, echoing the symbolic distance effect from cognitive science: comparisons between far-apart items are easier than comparisons between nearby ones.

This is a small but important methodological move. The system’s output is not treated as the whole story. The paper looks for a representational object that explains why the behavior works.

In simplified form, the hidden lesson is:

$$ \text{Generalization} \neq \text{good answers alone} $$

A better version is:

$$ \text{Generalization} \approx \text{good answers supported by stable internal structure} $$

The approximation sign is doing real work. The paper does not prove a universal theory of reasoning. It shows a concrete mechanism in a narrow setting.

And narrow is fine. Narrow mechanisms are how serious knowledge usually begins. Grand claims are cheaper; structure is more expensive.

The minimum useful standard: behavior plus geometry

The ordinal-geometry paper is especially helpful because it separates three things that are often blurred in AI evaluation:

Evaluation layer	What it asks	Why it matters
Output behavior	Does the model answer correctly?	Necessary, but often too coarse once performance saturates.
Representational geometry	Is there an internal structure aligned with the task relation?	Helps explain how the model generalizes beyond seen examples.
Causal mechanism	Do interventions on that structure change behavior as expected?	Needed before calling the structure mechanistic rather than merely diagnostic.

The paper reaches the first two layers and is careful about the third. It shows correlational evidence of rank-aligned geometry, including in a small synthetic setup and in probes of a pretrained Qwen2.5-1.5B model. It does not claim that intervening on the recovered axis has been shown to causally control model behavior. The authors explicitly leave causal interventions, broader model sweeps, and more domains for future work.

That caution is not a weakness. It is what separates analysis from the usual interpretability cosplay.

The pretrained LLM part adds an important complication. The authors probe ordinal domains such as digits, months, and size adjectives. Ordinal structure appears, but it is domain-dependent. Size adjectives and digits behave more like orderable domains; months are weaker, plausibly because the calendar is cyclic rather than a simple line. The model does not appear to have one universal “ordinality direction” shared across all domains. Each domain occupies its own representational axis.

That is the business-relevant punchline hidden inside the technical result.

A capability can be real without being universal.

A model may represent order in one domain and represent it differently, weakly, or awkwardly in another. This is exactly the kind of finding that should make procurement teams nervous about broad capability labels. “The model understands order” is not a serious statement. Which order? In which domain? At which layer? With what stability? Under what intervention?

Tedious questions, yes. Also the questions that save money.

The conceptual escalation: simulation is not the issue; realization is

The second paper, Intrinsic Computational Functionalism and Simulated Consciousness, is operating at a much higher altitude. It is not about ordinal reasoning. It is about artificial or simulated consciousness, and specifically about a common objection: a simulated brain is no more conscious than simulated water is wet.

That analogy has intuitive force. It is also, in the authors’ framing, too crude to settle the matter.

The paper distinguishes between external simulation, implementation, realization, and duplication. This distinction matters because people often slide between them. A static description of a brain is not a conscious brain. A movie of a brain is not conscious. A lookup table that merely reproduces outputs is not automatically conscious. Fine. Nobody needs to throw a conference dinner over that.

But the authors argue that those failures do not prove that no artificial or simulated system could realize consciousness-relevant organization. The real question is whether the system physically realizes the relevant intrinsic causal-computational structure.

The paper develops Intrinsic Causal-Computational Realization, or ICCR. Under ICCR, a candidate system must preserve more than input-output behavior. It must preserve intrinsic state individuation, transition structure, counterfactual intervention profiles, relevant readouts, and an adequate agent-body-world boundary.

This is where the connection to the ordinal-geometry paper becomes useful. In the simple Transformer case, we already saw that outputs are not enough. Internal geometry matters. For consciousness, the authors argue, boundary behavior is even less enough. The relevant structure must include internal mechanisms and how they respond under interventions.

In other words:

$$ \text{Boundary equivalence} \not\Rightarrow \text{realization} $$

A system can match outputs while failing to preserve the internal organization that a theory considers relevant. Conversely, an artificial system should not be dismissed merely because it is “a simulation” if it physically realizes the relevant structure at the appropriate grain.

This is not a proof that current AI systems are conscious. The paper is explicit about that. It is a conditional framework. If consciousness is grounded in intrinsic causal-computational organization, and if a candidate artificial system realizes that organization under ICCR, then substrate difference alone is not a sufficient reason to deny the corresponding consciousness-relevant properties.

That is a precise claim. It is also a useful antidote to two lazy positions:

“It behaves like a mind, therefore it has one.”
“It runs on a computer, therefore it cannot.”

Both are shortcuts. Both avoid the hard part: specifying the structure that matters.

The shared lesson: structure before status

The two papers are not making the same argument. One is empirical and narrow. The other is formal and philosophical. That is exactly why they work together.

The first paper shows that even a modest cognitive behavior, transitive inference, can involve internal organization that output accuracy alone does not reveal. The second paper says that for a much stronger claim, consciousness-relevant realization, internal organization is not optional. It is the object of evaluation.

The combined lesson is not “Transformers are conscious.” Please do not write that on a slide. The lesson is that serious cognitive claims require structure-aware evidence.

Here is the chain:

Chain step	Technical meaning	Business interpretation
Local evidence can produce global behavior	A Transformer trained on adjacent comparisons can generalize to distant comparisons.	A system may infer beyond the exact examples in its training or prompt context.
Generalization can correspond to internal organization	Successful models form a rank-aligned low-dimensional geometry.	Ask what representation supports the capability, not just whether the answer was right.
Structure may be domain-specific	Qwen2.5-1.5B shows ordinal geometry, but not one shared universal ordinal axis.	Do not generalize capability claims across domains without evidence.
Boundary behavior is insufficient for stronger cognitive claims	ICCR requires internal mechanisms, interventions, readouts, and adequate boundaries.	For high-risk AI claims, demos and benchmarks are not enough.
The evaluation target shifts	From “what did it output?” to “what organization did it realize?”	Governance should demand evidence of mechanism, stability, and causal relevance.

The business reader does not need to become a philosopher of consciousness. But they do need to understand why surface equivalence is a trap.

A chatbot that gives a good answer may not have the process you think it has. An agent that completes a workflow may not have robust task understanding. A model that scores well on a benchmark may not carry a portable internal representation of the relevant concept. A simulation that resembles a system may or may not realize the structure that matters.

The useful move is to stop asking status questions first.

Is it intelligent? Is it reasoning? Is it agentic? Is it conscious?

These are not useless questions, but they are usually premature. Better questions come first:

What internal variables support the behavior?
Are those variables stable across contexts?
Are they aligned with the task relation or only correlated with the dataset?
Do interventions on them change the output in predicted ways?
Does the claimed capability survive domain shift?
Does the system boundary include the variables that actually matter?
Is the relevant structure intrinsic to the system, or imposed by our interpretation?

Less glamorous. More likely to be true.

What the papers show versus what operators should infer

The distinction matters because business commentary has a habit of converting cautious research into executive folklore.

Here is the clean separation.

Topic	What the papers show	What operators may infer	What operators should not infer
Ordinal reasoning	A small Transformer trained on local comparisons can develop rank-aligned embedding geometry, and a pretrained LLM shows domain-dependent ordinal geometry.	Some generalization is supported by discoverable internal structure.	The model has human-like reasoning in general.
Accuracy	Accuracy can saturate while confidence and geometry reveal graded structure.	Benchmarks should be supplemented with internal diagnostics.	A high score is enough to certify a capability.
Domain transfer	Ordinal geometry appears differently across digits, months, and size adjectives.	Capability evidence is domain-specific unless shown otherwise.	A capability found in one domain automatically transfers.
Simulation and consciousness	ICCR defines a conditional realization standard based on intrinsic causal-computational organization.	Strong claims require specifying mechanisms, interventions, readouts, and boundaries.	Current digital AI is conscious, or impossible to be conscious, based on substrate labels alone.
Governance	Neither paper is a procurement checklist.	Both papers support structure-aware evaluation as a practical discipline.	A vendor’s cognitive label should be treated as evidence. Cute, but no.

This separation is the difference between useful synthesis and intellectual confetti.

The ordinal paper gives evidence of an internal representation behind a specific learned behavior. The consciousness paper gives a formal standard for what would have to be preserved when the claim is no longer “the model orders items” but “the system realizes a consciousness-relevant organization.”

The bridge is structure. Not vibes. Not labels. Not output mimicry. Structure.

A practical framework: the Structure-First AI evaluation stack

For operators, the immediate value is not metaphysics. It is evaluation discipline.

A structure-first evaluation stack would ask progressively stronger questions:

Level	Evaluation question	Example evidence
1. Behavioral competence	Does the system perform the task on held-out cases?	Accuracy, latency, error rates, task completion, calibrated confidence.
2. Representational alignment	Is there an internal representation aligned with the claimed concept or relation?	Probes, embedding geometry, activation patterns, feature attribution, latent-state diagnostics.
3. Domain stability	Does the representation survive across domains, formats, languages, or operating regimes?	Cross-domain tests, layer-wise comparisons, drift monitoring, transfer diagnostics.
4. Causal relevance	Do interventions on the representation change behavior as expected?	Activation patching, ablation, counterfactual editing, controlled perturbation.
5. Mechanism adequacy	Does the system preserve the internal mechanism required by the claim?	Transition analysis, intervention-readout profiles, recurrent or memory-state tests, tool-use trace analysis.
6. Boundary adequacy	Does the evaluated system include all relevant environment, body, memory, user, workflow, and feedback variables?	End-to-end workflow tests, agent-environment modeling, permissions and state audits, temporal continuity checks.

Most commercial AI evaluation stops at Level 1, perhaps with a decorative Level 2 if someone says “embedding space” during a meeting.

That is insufficient for serious claims.

For ordinary automation, Level 1 and part of Level 3 may be enough. If the system is extracting invoice totals, one does not need a theory of consciousness. Mercifully.

For higher-risk claims, the bar rises. If a vendor says the system “reasons,” you should want evidence that its internal process tracks the relevant variables rather than shortcuts. If a product is sold as “agentic,” you should want evidence of stable state, planning structure, feedback handling, and recovery under perturbation. If someone starts gesturing toward “human-like” capability, you should ask which human-like organization they have identified, where it is represented, and how they know it is causal.

This will not make you popular with the demo team. That is acceptable.

Why this matters for AI governance

Many AI governance programs are still built around outputs: toxicity, accuracy, factuality, bias, privacy leakage, latency, cost. These are necessary. They are also incomplete.

Output controls tell you what happened at the boundary. They do not always tell you what capability the system has, what shortcut it used, or whether the same performance will hold under changed conditions.

The two papers suggest a more mature governance posture:

Treat cognitive claims as structure claims.

This changes the burden of proof.

A model provider should not merely say, “The system can reason across documents.” It should be able to show what evidence supports that: retrieval dependency, cross-document variable binding, conflict resolution behavior, stability under paraphrase, and ideally internal diagnostics showing that the relevant relational structure is represented.

An agent vendor should not merely say, “The system plans autonomously.” It should show state persistence, subgoal decomposition, counterfactual recovery, tool-use discipline, and failure-mode containment.

A foundation-model lab should not merely say, “The model has emergent understanding.” It should identify the domains where internal structure appears, where it does not, and what interventions confirm causal relevance.

This is not anti-AI. It is anti-sloppiness. A subtle distinction, though apparently not subtle enough for everyone.

The misconception to avoid: outputs either prove everything or prove nothing

The common debate around AI cognition tends to bounce between two theatrical extremes.

One side sees fluent behavior and upgrades the system to something psychologically rich. The model explains a joke, writes code, solves a puzzle, and suddenly we are discussing whether it has beliefs. This is premature. Good output does not specify the mechanism.

The other side sees that the system is artificial, digital, statistical, simulated, or trained, and dismisses deeper capability by category. This is also premature. Substrate labels do not, by themselves, specify the absence of relevant organization.

The papers jointly block both shortcuts.

The ordinal-geometry paper says: look inside. The model’s success becomes more meaningful when tied to a rank-aligned internal structure, but even then the result is bounded, domain-sensitive, and not yet fully causal.

The ICCR paper says: look inside more rigorously. For consciousness-relevant claims, the relevant issue is not whether the system is called a simulation. It is whether it realizes the intrinsic causal-computational organization that the theory says matters.

This is a more demanding position than both hype and dismissal. Naturally, it will annoy both camps. That is usually a good sign.

The managerial version: ask for the load-bearing structure

A practical buyer or executive does not need to ask vendors about ICCR over lunch. That may be illegal in some jurisdictions of common sense.

But they should ask for the load-bearing structure behind any cognitive claim. For example:

Vendor claim	Better question
“The model understands hierarchy.”	What representation of hierarchy did you identify, and does it transfer across domains?
“The agent can plan.”	What internal state tracks subgoals, constraints, and recovery paths?
“The system has memory.”	Which stored variables affect future behavior, and how are they updated or forgotten?
“The model reasons symbolically.”	What evidence rules out shallow pattern matching on the benchmark?
“The simulation captures the process.”	Which causal mechanisms and intervention profiles are preserved?
“The AI behaves like an expert.”	Which expert-relevant distinctions are represented, and under what distribution shift do they fail?

The question is not whether the system has an impressive label. The question is what must be true internally for that label to be deserved.

This also helps with build-versus-buy decisions. If the claimed capability is not backed by visible structure, treat it as fragile. It may still be useful, but price it as behavior under tested conditions, not as general intelligence in a box.

The limitation: structure extraction is still hard

The obvious objection is that structure-aware evaluation is difficult.

Correct. That is why it is valuable.

The ordinal paper uses a clean task where the hidden relation is known and the internal geometry can be probed. Real enterprise systems are messier. The relevant concepts may not have a single axis. They may be distributed, sparse, compositional, or context-dependent. The same model may use different mechanisms for superficially similar tasks.

The ICCR paper goes further and admits that it does not yet provide a complete empirical method for extracting intrinsic structure from arbitrary physical systems. It specifies a realization constraint: relevant structure must be intrinsic, mechanism-enriched, intervention-sensitive, and boundary-adequate. But the actual extraction problem remains a research program.

That limitation matters. It means operators should not pretend they can fully certify deep cognitive properties today. It also means they should not retreat to output-only evaluation as if the hard thing being hard makes the easy thing sufficient.

A reasonable posture is graduated evidence.

For low-risk automation, behavior may be enough. For medium-risk decisions, require robustness and domain testing. For high-risk autonomy, require internal diagnostics and causal tests. For claims about human-like cognition or consciousness, require a clearly specified theory of the relevant organization. If that sounds demanding, good. The claim is demanding.

The combined conclusion: cognition is not a sticker

The two papers form a useful progression.

First, a narrow empirical result: a Transformer can learn transitive inference from local comparisons by organizing entities along a rank-like geometry. This turns a behavioral capability into a representational finding.

Second, a broad formal claim: if artificial or simulated systems are to be evaluated for consciousness-relevant realization, the relevant standard cannot be boundary behavior or substrate rhetoric. It must be intrinsic causal-computational organization, including mechanisms, interventions, readouts, and boundaries.

Together, they point toward the same operational principle:

The more cognitive the claim, the more structural the evidence must be.

This is the standard AI evaluation now needs. Not because every product is secretly a mind. Not because every benchmark is useless. Not because philosophy should run procurement, though it would at least make the meetings stranger.

It is needed because AI systems are increasingly being deployed in roles where surface behavior is mistaken for internal competence. The cost of that mistake rises with autonomy, scope, and institutional reliance.

So the next time a system is described as reasoning, agentic, human-like, reflective, or conscious-adjacent, the right response is not awe. It is not dismissal either.

It is a simple request:

Show me the structure.

Cognaptus: Automate the Present, Incubate the Future.

Nishit Singh, “Emergent Ordinal Geometry in Transformers Trained on Local Comparisons,” arXiv:2606.01269v2, 2 June 2026, https://arxiv.org/abs/2606.01269. ↩︎
Ryota Kanai and Shuqin Ma, “Intrinsic Computational Functionalism and Simulated Consciousness,” arXiv:2606.15348v1, 13 June 2026, https://arxiv.org/abs/2606.15348. ↩︎

TL;DR for operators#

The problem now: AI labels are getting cheaper#

The empirical anchor: a model learns a line, not just answers#

The minimum useful standard: behavior plus geometry#

The conceptual escalation: simulation is not the issue; realization is#

The shared lesson: structure before status#

What the papers show versus what operators should infer#

A practical framework: the Structure-First AI evaluation stack#

Why this matters for AI governance#

The misconception to avoid: outputs either prove everything or prove nothing#

The managerial version: ask for the load-bearing structure#

The limitation: structure extraction is still hard#

The combined conclusion: cognition is not a sticker#