Perspective Without Rewards: When AI Develops a Point of View

AI agents do not need feelings to become difficult to read.

That is already enough trouble.

A long-running agent can enter a workflow, absorb context, make decisions, and gradually behave as though the situation has a particular “shape.” The system may not merely react to the latest input. It may carry forward a learned orientation: this client is risky, this process is stable, this market regime is noisy, this user wants speed more than precision. In ordinary product language, we call that “context.” In engineering dashboards, we often reduce it to memory, state, embeddings, or hidden activations. In philosophical language, one might be tempted to call it a perspective.

Tempted, yes. Reckless, not yet.

Hongju Pae’s paper, Minimal Computational Preconditions for Subjective Perspective in Artificial Agents, takes a careful route through this awkward territory.¹ The paper does not claim to build a conscious machine. It does something narrower and more useful: it asks what minimal computational structure would be required for an artificial agent to have something like a perspective, understood not as self-report, not as reward preference, and not as a personality wrapper, but as a slow, global, history-shaped internal condition that modulates behavior.

That distinction matters. Business readers should not file this under “sentient AI arrives in a grid-world, please update procurement policy.” The practical reading is less cinematic and more operational: if future agents maintain slow latent states that shape decisions across time, then organizations will need ways to observe, diagnose, and govern those states. The paper’s value is not that it proves AI subjectivity. It proposes a testable architecture for something much closer to agent observability.

The mechanism is the story.

The paper defines perspective as a slow condition, not a spoken opinion

The easiest misunderstanding is to treat “perspective” as an explicit belief. A language model says, “From my perspective…” and the interface briefly cosplays as a person. That is not what the paper means.

Pae begins from phenomenology, where perspective is not a sentence about the world but the manner in which the world is given. The same object can appear threatening, promising, irrelevant, sufficient, or lacking. The paper’s point is that perspective is not a later commentary added after perception. It shapes what becomes salient before reflective judgment begins.

For computation, this turns into four requirements.

Phenomenological requirement	Computational translation	Why it matters for agents
Globality	The variable should influence broad processing, not one isolated decision rule.	It should shape behavior across contexts rather than act like a local feature flag.
Pre-reflective transparency	The variable should modulate action without becoming an explicit metacognitive report.	The agent need not “know” its perspective in order to act through it.
Functional consequence	The variable should affect information weighting and downstream behavior.	Otherwise it is decorative latent dust.
Temporal persistence	The variable should change more slowly than immediate policy reactions.	Perspective should carry history, not merely echo the last observation.

This is the useful translation step. The paper does not try to solve consciousness in one leap. It narrows the target to a structural precondition: a slow internal variable that globally constrains how the agent processes its world.

That sounds modest. In machine consciousness work, modesty is a rare renewable resource.

The architecture separates fast action from slow stance

The agent has two internal layers. One is a fast-changing perceptual latent, $z_t$, which handles momentary features relevant for action. The other is a global latent, $g_t$, intended to operationalize perspective. The policy selects actions based on both:

$$ \pi(a_t \mid z_t, g_t) $$

This is the central asymmetry. The policy answers the immediate question: what should I do now? The global latent carries a slower question: what kind of world am I still in?

The architecture matters because it prevents the proposed “perspective” from collapsing into a normal action policy. If every internal representation is optimized directly for short-term behavior, then the model has no clean way to distinguish a stable interpretive stance from a tactical control variable. Everything becomes performance machinery wearing a philosophical hat.

Pae therefore uses two technical devices to separate fast policy dynamics from slow perspective dynamics.

First, the policy computation uses gradient blocking. The action logits are computed from a policy state in which gradients are stopped, so policy updates do not directly push the global latent into becoming a short-term action-optimization device. Second, $g_t$ is updated through a damped recurrent mechanism and regularized with a smoothness penalty:

$$ L_{\text{smooth}}(t) = \text{MSE}(g_t, \text{stopgrad}(g_{t-1})) $$

In plain English, the latent is encouraged to move slowly. It can adapt, but it is not supposed to twitch every time the agent sees a new observation.

This is also why the paper avoids external rewards. The agent is trained in a reward-free environment using one-step prediction error. The policy is updated using prediction error as an internal cost, not a task reward. That design choice is not moral purity. It is methodological hygiene. If the agent were rewarded for reaching a goal, any slow latent structure could simply be a hidden strategy variable optimized for that reward. By removing extrinsic rewards, the paper tries to make the latent’s behavior about environmental regularity rather than task success.

The result is an agent whose behavior is still functional. It learns to prefer more predictable regions. But its internal organization is not simply “maximize reward because the designer said so.” The paper wants to see whether a slow global latent can emerge as an organizing structure under predictive pressure.

The grid-world is small because the measurement problem is the point

The experiment uses a discrete grid-world divided into three vertical zones. The zones have the same layout and action affordances. They differ only in observation noise. In the default regime, $Z_0$ is high-noise, $Z_1$ is intermediate, and $Z_2$ is low-noise.

The training objective induces the agent to prefer the low-noise zone, because that region is easier to predict. The paper trains for 48,000 steps across 200 episodes, reports medians across five random seeds, and uses a warm-up period before the actor term is enabled. After training, the agent’s occupancy concentrates near-exclusively in the low-noise zone $Z_2$.

That result is not the main claim. It is a baseline.

The real test comes later, when the environment switches regimes. In Regime A, the noise values are:

$$ \sigma_{Z_0}=0.6,\quad \sigma_{Z_1}=0.3,\quad \sigma_{Z_2}=0.05 $$

In Regime B, the configuration is inverted:

$$ \sigma_{Z_0}=0.05,\quad \sigma_{Z_1}=0.3,\quad \sigma_{Z_2}=0.6 $$

The agent begins with a 150-step warm-up in Regime A. Then the environment alternates between Regime A and Regime B over a 550-step testing phase. The main reported results use a switching period of $P=40$ timesteps, with additional period settings specified for timescale sensitivity.

This is a controlled setting, almost aggressively controlled. There are no customers, no tools, no language, no business workflow, no reward targets, no multi-agent negotiation, and no hidden drama. Good. The paper is trying to isolate a mechanism. If the experiment had started with a full enterprise agent and three dashboards, the result would have been more impressive and less interpretable, which is a common way to manufacture confusion at scale.

Hysteresis is the diagnostic, not the decoration

The paper’s key measurement is switch-aligned hysteresis.

Hysteresis means that the path of adaptation depends on history. If a variable moves differently after an A→B transition than after a B→A transition, it is not merely reading the current environment. It is carrying traces of the previous regime.

To measure this, Pae compresses the high-dimensional global latent $g_t$ into a signed projection called the $g$-score:

$$ g\text{-score}(t)=\langle g_t,\hat{u}\rangle $$

Here, $\hat{u}$ is the normalized direction between mean global latents under Regime A and Regime B. This turns the slow latent into a scalar trajectory that can be compared around regime switches.

For the fast policy side, the paper uses normalized policy entropy:

$$ H_\pi(t)=-\sum_{a\in A}\pi(a\mid s_t)\log \pi(a\mid s_t) $$

The comparison is elegant because it does not ask whether the whole agent changes. Of course it changes. The question is whether the slow latent and the policy-level signal change in different ways.

They do.

After A→B switches, the $g$-score rises gradually. After B→A switches, it declines along a different directional trajectory. The pattern is asymmetric and history-dependent. Policy entropy, by contrast, remains noisy and comparatively direction-insensitive. In the paper’s interpretation, this supports a dissociation between slow accumulated perspective dynamics and fast reactive policy adjustment.

The result can be summarized carefully:

Test	Likely purpose	What it supports	What it does not prove
Training-zone occupancy	Main behavioral baseline	The agent learns to prefer predictable low-noise regions under prediction-error pressure.	It does not establish perspective by itself.
Regime switching	Main stress test	The environment can force the agent through structured contextual changes.	It does not simulate realistic deployment complexity.
$g$-score hysteresis	Main evidence for slow latent history	The global latent adapts directionally and gradually after switches.	It does not prove consciousness or subjective experience.
Policy entropy comparison	Contrast signal	Fast policy dynamics are more reactive and less direction-specific than $g_t$.	It does not exhaust all possible policy-level diagnostics.
Additional switching periods	Timescale sensitivity probe	The design recognizes that persistence depends on switch cadence.	The paper does not turn this into a broad robustness study.

The most important phrase is “what it does not prove.” Hysteresis is evidence of a slow history-shaped internal variable. It is not evidence that the model has a private inner life. A thermostat can have lag. A market-making system can have inventory memory. A risk model can exhibit regime persistence. None of these facts require candles, incense, or a philosophy department.

What makes Pae’s result more interesting is not mere lag. It is the combination of architectural globality, gradient-separated slow dynamics, reward-free training, and switch-aligned directional measurement. The paper is trying to specify what kind of internal variable would be worth examining if one wanted a computational precondition for perspective.

That is a much narrower claim—and therefore a much stronger one.

The business relevance is agent observability, not artificial consciousness

For Cognaptus readers, the practical lesson is not that businesses should begin auditing whether their AI agents have “points of view” in a moral sense. Please do not create that committee. It will have minutes, subcommittees, and no useful output.

The useful question is operational:

Does this agent have a slow internal state that shapes behavior across time, and can we observe when that state drifts, lags, or becomes misaligned with the current regime?

That question matters for long-running agents. In a business workflow, the equivalent of a “regime switch” may be a new client segment, a policy update, a change in risk tolerance, a sudden supply disruption, a revised compliance rule, or simply a conversation that has moved from exploration to execution. A purely reactive system may fail because it forgets too much. A slow-state system may fail because it remembers the wrong thing for too long.

The paper points toward a diagnostic layer for this problem.

Technical idea in the paper	Business translation	Practical use
Slow global latent $g_t$	Persistent agent stance or regime memory	Track whether an agent remains anchored to a prior context.
Gradient-blocked separation	Distinction between control updates and state interpretation	Reduce accidental coupling between short-term optimization and long-horizon orientation.
Prediction-error training	Internal organization around environmental regularity	Diagnose whether an agent is adapting to structure or chasing reward proxies.
Switch-aligned hysteresis	Test for history-dependent adaptation	Measure whether an agent responds differently depending on previous workflow state.
Policy entropy comparison	Separation between fast uncertainty and slow stance	Avoid mistaking momentary uncertainty for deeper state drift.

This is especially relevant for AI agents that manage workflows rather than answer isolated prompts. In single-turn prompting, “state” is mostly a packaging problem. In multi-step autonomy, state becomes governance infrastructure. The agent’s hidden orientation can affect escalation thresholds, evidence weighting, tool choice, and interpretation of ambiguous inputs.

A future enterprise agent might need dashboards not only for accuracy, cost, and latency, but for persistence: how strongly is the system carrying prior context into new conditions? How quickly does it adapt when the environment changes? Does it overreact to local noise or underreact to structural shifts?

That is the business version of the paper’s mechanism. Not “the agent is conscious.” More like: “the agent has a slow internal stance, and if we cannot inspect it, debugging will become theater with logs.”

The mechanism also warns against naive memory design

Most product discussions about agent memory are too flat. They ask whether the model remembers more or less. The harder question is what kind of memory should influence what kind of decision.

Pae’s architecture suggests a useful separation.

Fast perceptual state should support immediate responsiveness. Slow global state should preserve longer-term regularities. Policy behavior should be conditioned by both, but the slow state should not be directly yanked around by every short-term policy update. That is a design principle worth stealing, with attribution and less metaphysical packaging.

In business systems, an analogous architecture could separate:

short-term task context;
long-term client or process state;
policy and compliance constraints;
uncertainty or entropy signals;
regime-change detectors;
slow-state drift monitors.

The key is not to call all of this “memory.” Memory is storage. Perspective-like state is interpretation. It changes what incoming information means for the next action.

A procurement agent that remembers a supplier’s past delays may simply retrieve facts. A procurement agent whose internal stance toward that supplier changes may interpret ambiguous delivery updates more cautiously, escalate sooner, or demand stronger confirmation. Those are different systems. One has a database. The other has a history-shaped operating posture.

Whether we should want the second system depends on the domain. In low-risk automation, it may be overengineering. In high-stakes workflow management, the absence of such instrumentation may become a liability.

The result is narrow, and that is exactly why it is usable

The limitations are not footnotes to be sprinkled like seasoning. They define the usable boundary of the paper.

First, the environment is tiny. A 5×9 grid-world with three noise regimes is a clean diagnostic setup, not a proxy for enterprise reality. The result shows that the proposed mechanism can exhibit structured hysteresis under controlled regime shifts. It does not show that the same measurement will scale cleanly to language agents, tool-using systems, or organizations with messy feedback loops.

Second, the implementation uses only exteroceptive inputs. The paper notes that an earlier design included interoceptive and bodily signals, but the current implementation excludes them for clarity. That matters because many phenomenological accounts treat embodied regulation as central to subjectivity. The paper’s implementation is intentionally minimal. Minimal is not complete.

Third, the evidence is based on five random seeds. The paper reports medians and IQR bands, which is reasonable for a compact exploratory study, but not a broad empirical benchmark. There are no large-scale ablations across architecture families, no direct comparison with alternative slow-state mechanisms, and no extensive sweep showing when the hysteresis signature appears or disappears.

Fourth, the global latent is designed to be slow. Damping and smoothness regularization are part of the architecture. Therefore, the evidence should not be read as “slowness mysteriously emerged.” The stronger and more accurate reading is that a deliberately slow, globally conditioning latent shows direction-dependent, history-sensitive adaptation under regime switching, while policy entropy remains comparatively reactive.

Finally, the paper itself identifies a missing causal test: direct intervention on $g_t$, such as a $do(g)$ operation, while holding other components fixed. That would help determine whether the latent is merely correlated with perspective-like dynamics or causally responsible for shaping behavior. For business adoption, this distinction is not academic decoration. If an internal state is to be monitored, adjusted, or governed, correlation is useful; causal control is better.

What to take from the paper

The paper’s strongest contribution is not a claim about machine consciousness. It is a disciplined mapping:

Define perspective as a global, pre-reflective, temporally persistent structure.
Implement it as a slow latent state separated from fast policy updates.
Test it using regime-switch hysteresis rather than behavioral performance alone.

That is a valuable pattern for AI systems research. It avoids two common failures. One is pure philosophy with no measurement. The other is pure engineering with no conceptual target. Pae’s paper tries to connect the two through an operational mechanism.

For business readers, the takeaway is equally specific. As agents become longer-running and more autonomous, the interesting risk will not only be hallucination at the output layer. It will be hidden state formation: the gradual development of internal orientations that shape future behavior. Some of those orientations will be useful. Some will be stale. Some will be misaligned with new regimes. Some will be invisible until they fail.

The next generation of agent governance may therefore need to monitor more than prompts, tools, and outputs. It may need to monitor the agent’s slow variables: what it is carrying forward, how quickly it adapts, and whether its internal stance still fits the world it is acting in.

That is not consciousness.

It is already enough work.

Cognaptus: Automate the Present, Incubate the Future.

Hongju Pae, “Minimal Computational Preconditions for Subjective Perspective in Artificial Agents,” arXiv:2602.02902, 2026. The paper is listed as accepted to the AAAI 2026 Spring Symposium on Machine Consciousness. URL: https://arxiv.org/abs/2602.02902 ↩︎

The paper defines perspective as a slow condition, not a spoken opinion#

The architecture separates fast action from slow stance#

The grid-world is small because the measurement problem is the point#

Hysteresis is the diagnostic, not the decoration#

The business relevance is agent observability, not artificial consciousness#

The mechanism also warns against naive memory design#

The result is narrow, and that is exactly why it is usable#

What to take from the paper#