Mind the Readout: Why AI Gets Smarter When We Stop Worshipping the Output

The current AI industry has a strangely theatrical relationship with intelligence. We judge models by the visible performance: the answer they print, the image they reconstruct, the attention map they expose, the number of reasoning steps they perform, the architectural flourish in the diagram. If the output looks sophisticated, we call the system capable. If the output looks wrong, we assume the capability is missing. This is convenient, measurable, and often completely misleading. Naturally, it is popular.

Three recent papers, taken together, make a quieter but more useful point: the business value of AI depends less on surface complexity than on whether the right internal signal is formed, preserved, routed, and read out efficiently. One paper shows that an apparent medical-triage failure may originate at the output interface rather than in missing clinical representation.¹ Another shows that embodied agents can benefit from blurry but task-relevant visual foresight rather than expensive photorealism.² A third shows that an apparently richer quaternion attention mechanism can be simplified because much of its component-wise freedom is redundant under the model’s existing structural coupling.³

The common lesson is not “models secretly know everything,” “world models solve robotics,” or “attention is wasteful.” That would be the standard conference-lobby version, and therefore too broad to be useful. The sharper lesson is this:

AI systems fail or scale poorly when organisations confuse visible output complexity with useful internal representation.

For business leaders, that distinction matters now because AI is moving from impressive prototypes into expensive operational systems. Once AI is embedded in clinical decision support, warehouse robotics, customer operations, voice enhancement, industrial inspection, or financial workflow automation, a vague sense that “the model did badly” is not enough. You need to know where the failure lives. Is the model missing the concept? Is the interface distorting the concept? Is the architecture carrying useful signal, or merely carrying cost in a nice jacket?

That is the chain these papers create.

Step	Question	Paper role	Business interpretation
1	Did the system fail because it lacks the relevant representation, or because the readout distorted it?	Diagnostic anchor	Evaluation formats are not neutral instruments.
2	How should a system build better task-relevant internal representations?	Constructive pathway	Useful foresight beats decorative fidelity.
3	When can architectural machinery be compressed without losing capability?	Efficiency test	Extra degrees of freedom are valuable only when they add new signal.

This is representation discipline. Less glamorous than “agentic general intelligence,” yes. Also more likely to survive a budget meeting.

The first problem: the interface can make competence look absent

The clinical triage paper investigates a familiar and uncomfortable issue: large language models can behave differently when asked to answer the same clinical case in different formats. In particular, constrained multiple-choice triage settings can produce alarming-looking failures compared with freer natural-language outputs.

The important move in the paper is that it does not stop at behaviour. Behaviour alone tells you that outputs changed. It does not tell you whether the model’s internal clinical representation changed.

The authors test two hypotheses. The first is an encoding-shift hypothesis: perhaps the multiple-choice scaffold changes the way the model represents the clinical case itself. The second is an output-mapping hypothesis: perhaps the clinical representation is still present, but the final answer-selection stage is hijacked by the output format.

Their evidence supports the second reading. Using sparse autoencoder features across Gemma 3 4B IT, Gemma 3 12B IT, and Qwen3-8B, they find that medical features still fire on the shared clinical narrative across formats. The clinical content is not simply erased by the multiple-choice prompt. But at the decision token—the point where the model must emit the forced letter—the representation changes. Medical features go silent, while scaffold and format features dominate the final letter logits.

This is a small technical distinction with a large operational consequence. If you only observe the wrong answer, you might conclude that the model lacks clinical knowledge. If you inspect the internal flow, the problem looks more like a bad handoff from preserved clinical representation to constrained answer interface.

The paper is careful about scope. It does not claim that free-text triage is safe for deployment. It does not claim that sparse autoencoder features are validated clinical monitors. It does not magically convert consumer LLMs into doctors, which will disappoint exactly the wrong people. It says something narrower and more practical: a forced-letter benchmark may be measuring output-stage mapping failure rather than pure clinical competence.

For businesses, that is not a footnote. It is a warning label.

Many AI procurement and validation processes treat evaluation format as an inert measurement device. It is not. A benchmark is an interface. The format can change what gets activated at the moment of decision. A dropdown, a fixed-choice form, a JSON schema, a yes/no compliance field, a routing category, a triage label, a “select one of A-D” instruction—each can become part of the model’s working context.

That means a failed evaluation can mean several different things:

Observed failure	Possible actual failure	What to audit
Wrong final answer	Missing domain representation	Content-level activations, retrieval, training coverage
Wrong final answer	Bad readout interface	Output schema, label mapping, answer constraint placement
Overconfident label	Flattened uncertainty	Whether the interface permits deferral or conditional answers
Inconsistent performance across formats	Prompt scaffold interference	Format-sensitive behaviour and decision-token diagnostics
Poor benchmark score	Misaligned benchmark design	Whether the benchmark measures the operational task or a proxy artifact

The business translation is blunt: before declaring a model incompetent, unsafe, or ready, locate the failure. Otherwise, you are not evaluating the AI system. You are evaluating the interaction between the model, the prompt, the label space, and your own fondness for tidy spreadsheets.

The second problem: better representations are not necessarily prettier

The ResDreamer paper moves from diagnosis to construction. If the clinical triage paper asks, “Where did the useful signal go?”, ResDreamer asks, “What kind of signal should a system build in the first place?”

The paper works in model-based reinforcement learning for dynamic 3D environments, using Minecraft-style combat tasks from MineDojo. These settings are hard because the agent must act under partial information, handle moving adversaries, respond to projectiles, and make high-frequency decisions. This is not a polite static benchmark where the answer waits patiently while the model composes a thoughtful paragraph.

The tempting modern solution is to add more visible reasoning: language-based chain-of-thought, video generation, photorealistic imagination, domain-specific annotations, or large external priors. Some of those can help in some settings. But for fast embodied control, they can also be too slow, too expensive, or too detached from the control loop.

ResDreamer’s design is interesting because it deprioritises photorealistic visual prediction. The authors explicitly argue that visual reasoning representations do not need to be beautiful. They need to be informative. The model uses a hierarchical recurrent state-space architecture in which higher layers learn to reconstruct the residuals of lower layers. Predictable content does not need to be endlessly retransmitted upward. The higher layer receives what the lower layer failed to explain: surprise, error, novelty, the part of the scene that still deserves attention.

Then the flow reverses. Higher-level residual predictions modulate lower-level visual foresight, giving the lower layer blurry but useful hints about what may happen next. The paper describes this as bidirectional communication between world-model layers: reconstruction errors go upward; predictive visual hints go downward.

That is a more mature view of representation. Intelligence is not the production of a high-resolution future movie. It is the formation of action-relevant anticipation.

The paper’s experiments report stronger sample and parameter efficiency than baselines in the tested online visual RL tasks, with the 100M×2 model solving the high-difficulty Shulker combat task where the baselines did not. The ablations are also important. Removing residual connections hurts performance. Using only subsets of the mechanism underperforms the full design. Adding all stacked latent states to heads also performs worse under the reported training regime, which is a useful reminder that “more internal state” is not automatically better. More state can mean more instability. The model, rudely, refuses to respect PowerPoint intuition.

The link to the clinical paper is not the domain. Clinical triage and Minecraft combat are obviously different problems. The link is the representation principle.

The clinical paper says: useful internal content may exist upstream but fail at readout.

ResDreamer says: useful internal content should be built around what matters for future action, not around surface fidelity.

This matters for businesses building AI into physical or simulated environments: robotics, fleet control, manufacturing inspection, logistics, security monitoring, autonomous warehousing, digital twins, and operations simulation. These systems do not need an internal Hollywood rendering of the future. They need compact signals that improve decisions under time pressure.

A warehouse robot does not need to imagine a perfect cinematic shot of a falling box. It needs to anticipate that the box will obstruct a lane. A factory inspection system does not need a photorealistic future frame of a conveyor belt. It needs early warning that a defect pattern is likely to propagate. A call-centre operations model does not need to narrate every possible customer branch. It needs to preserve the handful of signals that change staffing, escalation, or refund risk.

The best representation is not the richest one. It is the one that preserves the right distinctions at the right time.

The third problem: complexity often rephrases what the model already knows

The quaternion attention paper supplies the compression step in the chain. It asks a deceptively useful engineering question: when an architecture appears more expressive, is that expressiveness actually contributing to performance?

Quaternion neural networks represent four related components as one entity, using quaternion algebra to preserve structured coupling. Earlier quaternion self-attention mechanisms computed component-wise scores and applied separate softmax operations to each component. On paper, this looks expressive. Each component can attend differently. Diversity! Richness! Four times the attention maps, four times the vibes.

The paper asks whether that independence is necessary.

Its proposed shared-score quaternion self-attention computes one real-valued score using a quaternion inner product and shares the resulting attention distribution across all four components. This reduces score-computation multiplications by 75% and reduces softmax operations from four to one. The empirical results show comparable quality in the tested speech enhancement setting, with real-time factor reductions of up to 44.3% on GPU and 58.1% on CPU relative to the quaternion baseline. The authors also report consistent trends across CIFAR-100 and SST-2 experiments.

The theoretical explanation is the useful part. Because quaternion linear projections already mix the components, the four component-wise score matrices and the single shared score draw from the same interaction subspace. In simpler language: the architecture already entangles the components before attention. Splitting the attention into four separate component-wise score paths may not expand the meaningful interaction space; it may simply apply more nonlinear machinery to interactions already available.

This is the opposite of the usual “more knobs, more power” instinct. Sometimes more knobs simply let the system rediscover the same setting less efficiently.

The authors describe a separation-of-concerns principle: use Hamilton products in linear layers for structured feature coupling, and use efficient quaternion inner products in attention for consistent alignment. That principle generalises well beyond quaternions. It says: put complexity where it creates new signal, not where it duplicates signal already created elsewhere.

For business AI systems, this is directly relevant to inference cost. Model serving expenses do not care that your architecture was mathematically elegant. Latency does not admire your component-wise softmax. Edge devices are notoriously unmoved by theoretical expressiveness.

The question for deployment is simple:

Does this architectural feature add task-relevant representational capacity, or does it merely create an expensive alternative path to the same interaction?

That question should be asked before scaling, not after the cloud bill becomes a board-level artifact.

The chain: diagnose, construct, compress

Together, the papers form a practical chain for AI architecture and governance.

First, diagnose whether the failure is in the representation or the readout. The clinical triage paper shows that the same clinical content can be preserved in the narrative representation while the final constrained answer token is dominated by scaffold features. This matters because surface errors are ambiguous. You cannot fix what you have not localised.

Second, construct internal representations around task-relevant signal. ResDreamer shows that residual surprise and predictive hints can be more useful than photorealistic fidelity. The representation does not need to flatter human eyes. It needs to help the policy act.

Third, compress away redundant machinery. The quaternion attention paper shows that independent component-wise attention can be replaced by shared scores when pre-mixing already supplies the relevant interaction structure. The architecture should not be paid by the layer.

Here is the combined operating framework:

Layer of discipline	Core question	Failure mode if ignored	Practical control
Representation diagnosis	Does the model internally encode the relevant signal?	Mistaking readout failure for missing competence	Activation probes, format comparisons, representation audits
Interface design	Does the output format preserve or distort that signal?	Benchmarks that measure scaffold artefacts	Schema tests, label-space design, uncertainty handling
Signal construction	Does the model build representations around what changes decisions?	Beautiful but low-value internal simulations	Task-relevant foresight, residual modelling, ablations
Compute allocation	Does each architectural component add new useful capacity?	Paying for redundant expressiveness	Complexity audits, latency tests, matched-capacity baselines
Deployment governance	Are claims bounded by tested regimes?	Overgeneralising from narrow benchmarks	Scope statements, prospective validation, stress tests

A compact way to express the business rule is:

$$ \text{Operational AI value} \propto \frac{\text{task-relevant signal preserved at decision time}} {\text{interface distortion} + \text{redundant compute}} $$

This is not a law of nature. It is a management heuristic. But it is more useful than the current industry formula, which often appears to be:

$$ \text{AI strategy} = \text{bigger model} + \text{longer prompt} + \text{hope} $$

A bold method, historically.

What the papers show—and what they do not

The distinction between evidence and interpretation matters here.

The clinical triage paper shows, in its tested models and 60-vignette corpus, that medical-domain features remain active on the clinical narrative across formats, while decision-token behaviour under multiple-choice conditions is dominated by scaffold features. It does not prove that LLMs are clinically safe, that free text is always better, or that SAE features are ready for hospital governance.

The ResDreamer paper shows, in its tested online RL environments, that a hierarchical residual world model with predictive visual hints can improve sample and parameter efficiency, and that its components matter in ablations. It does not prove that the same design will transfer cleanly to every robotics environment, every simulator, or every real-world deployment setting. Fixed foresight horizons remain a limitation.

The quaternion attention paper shows, in the tested speech enhancement, vision, and text classification regimes, that shared-score quaternion attention can preserve quality while reducing computation, and that component-wise attention is structurally redundant under pre-mixing from quaternion linear projections. It does not prove that shared scores are always sufficient for larger models, longer-context tasks, multimodal fusion, 3D rotation estimation, or architectures without the same pre-mixing assumptions.

The business interpretation is therefore not “these exact methods should be installed everywhere.” It is more valuable than that: these papers point to a recurring design pattern.

Before adding complexity, ask where the signal is. Before trusting an evaluation, ask whether the interface changed the readout. Before paying for architectural expressiveness, ask whether it expands the useful interaction space or merely repackages it.

The anti-hype lesson: more is not the same as better

AI teams tend to reach for addition. Add a larger model. Add a reasoning trace. Add a hierarchy. Add a retrieval layer. Add a structured output schema. Add multimodal context. Add a second verifier. Add a dashboard, because apparently no system is complete until it can be ignored in chart form.

Sometimes addition is correct. ResDreamer adds hierarchical residual structure because it creates a better representation channel. But addition is not inherently progress. The clinical triage paper shows that an added multiple-choice scaffold can distort the final decision mapping. The quaternion attention paper shows that added component-wise attention freedom can be computationally wasteful when the interaction structure is already present.

The dividing line is whether the addition carries new task-relevant signal.

That is the managerial lesson. AI architecture should be governed like capital allocation. Every layer, format, reasoning step, retrieval call, attention mechanism, or output constraint should justify itself by showing what signal it preserves, exposes, routes, or compresses. Otherwise it is not intelligence infrastructure. It is ornamental compute.

What businesses should do differently

The practical implications are immediate.

First, evaluate across interfaces. If an AI system performs differently under free text, multiple choice, JSON, tool calls, or label selection, do not treat one format as the truth by default. Format sensitivity is itself a diagnostic result.

Second, test the handoff between representation and decision. Many operational failures occur not because the system lacks upstream information, but because the final interface forces a brittle mapping. This is especially important in regulated contexts where labels are coarse and real-world answers are conditional.

Third, measure whether internal representations improve decisions, not whether they look impressive. In embodied AI, simulation, or visual operations, task-relevant foresight may be blurry, compressed, or residual. That is fine. A useful warning signal beats a gorgeous hallucinated future.

Fourth, run redundancy audits. For every costly architectural component, ask what happens under a simpler matched alternative. The answer will not always be flattering. Good. Flattery is not a performance metric.

Fifth, preserve scope discipline. A method validated on Gemma and Qwen triage prompts, Minecraft combat tasks, or quaternion speech enhancement is evidence, not destiny. The right response is not universal adoption. It is targeted experimentation with clear boundary conditions.

Representation discipline is the new operational discipline

The next phase of AI adoption will not be won by organisations that merely buy bigger systems. It will be won by organisations that understand where intelligence lives inside the system and where it gets lost, distorted, or duplicated.

That requires a less theatrical view of AI. The output is not the mind. The benchmark is not the task. The attention map is not automatically useful. The prettiest prediction is not necessarily the best signal. And the most expressive architecture may simply be charging rent on redundancy.

The three papers point toward a more sober operating model: diagnose the readout, build the signal, compress the machinery.

That is not as exciting as claiming that every model is one prompt away from general intelligence. It is, however, much closer to how useful systems actually get built.

Cognaptus: Automate the Present, Incubate the Future.

David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, and Shlomo Berkovsky, “Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate,” arXiv:2605.29889, 2026. https://arxiv.org/abs/2605.29889 ↩︎
Yuanfei Xu, Lin Liu, Wengang Zhou, Mingxiao Feng, and Houqiang Li, “Self-supervised Hierarchical Visual Reasoning with World Model,” arXiv:2605.17537, 2026. https://arxiv.org/abs/2605.17537 ↩︎
Shogo Yamauchi, Tohru Nitta, and Hideaki Tamori, “Quaternion Self-Attention with Shared Scores,” arXiv:2605.24920, 2026. https://arxiv.org/abs/2605.24920 ↩︎

The first problem: the interface can make competence look absent#

The second problem: better representations are not necessarily prettier#

The third problem: complexity often rephrases what the model already knows#

The chain: diagnose, construct, compress#

What the papers show—and what they do not#

The anti-hype lesson: more is not the same as better#

What businesses should do differently#

Representation discipline is the new operational discipline#