Agents Without Time: When Reinforcement Learning Meets Higher-Order Causality

Handoffs Are Where Fixed Time Sneaks Into Agent Design

Handoffs look harmless. One agent collects evidence, another checks it, a third decides, and a fourth sends the answer to a customer, robot, trader, or dashboard. The workflow diagram has arrows. The arrows have a direction. Someone decided which component acts first.

Usually that decision is treated as engineering housekeeping. In Matt Wilson’s paper, it becomes the point of the story.¹

The paper is not offering a new reinforcement learning algorithm. It is not a benchmark where a clever agent beats another clever agent after a weekend of GPU therapy. It is a formal paper, and the formal claim is stronger, stranger, and more useful for architecture thinking: deterministic finite-memory agents in POMDPs correspond exactly, up to behavioral equivalence, to one-input process functions from higher-order causality.

That sentence sounds like it was assembled in a laboratory without windows. The business version is simpler:

an agent policy is not just a rule for choosing actions; it can be treated as a higher-order object that plugs into an environment, and the way it plugs in determines what kinds of causal structure the system can express.

Once the paper moves to decentralized multi-agent systems, this matters. Some coordination failures are not caused by weak models, poor prompts, or insufficient context windows. They are caused by the architecture forcing somebody to act “first” when the task itself does not naturally have a first mover. Charming. We built a ceiling and then blamed the furniture.

The Mechanism: An Agent Policy Is a Process You Can Plug an Environment Into

The paper starts from deterministic POMDPs. A POMDP has hidden state, actions, observations, transition dynamics, and rewards. The agent does not directly see the environment state, so it keeps memory. In Wilson’s setup, an agent-state policy has two components:

Agent component	What it does	Why it matters
Policy $\pi : M \to A$	Chooses an action from memory	This is the outward-facing decision
Update $U : M \times A \times \Omega \to M$	Updates memory after action and observation	This is how the agent carries history forward

The paper stays deterministic. That boundary matters. No stochastic policy gradients are hiding in the basement. The goal is not to train such agents but to reveal a structural equivalence.

Now place this beside a one-input process function. In the paper’s setting, a process function is a higher-order object: it can be evaluated on another function. It is constrained by a unique fixed-point condition, which guarantees that when it is connected to another process, the resulting loop has one consistent solution.

That fixed-point condition is the bridge. It is what allows a process function to be interpreted as a higher-order operation without producing logical nonsense. In physics language, process functions arise as the classical deterministic limit of higher-order quantum operations. In AI language, Wilson shows that they can encode agent-state policies.

The core construction is beautifully plain once translated:

$$ w_{[\pi,U]}(m,o) = (U(m,\pi(m),o), \pi(m)). $$

Read it slowly. The process function receives memory $m$ and observation $o$. It outputs the next memory and the action. The action comes from $\pi(m)$; the memory update uses the observation once the environment has responded.

Then the environment itself can be packaged as a function: action plus state goes in; observation, next state, and reward come out. When the process function is contracted with the environment function, the result is exactly the one-step evaluation of the agent interacting with the POMDP.

This is the first major result: behavioral-equivalence classes of deterministic agent-state policies are in one-to-one correspondence with one-input process functions. Two policies may use different internal update descriptions, but if they behave identically against every deterministic POMDP, they induce the same process function.

That is not analogy. It is accounting.

The Fixed-Point Condition Rebuilds Policy and Memory

The clever part is that the process-function side does not merely imitate agent structure after the author waves at it politely. The unique fixed-point condition forces a decomposition.

For a one-input process function, the component that supplies the input to the environment cannot depend on the environment’s output from the same step. Otherwise, one can construct a feedback situation with multiple fixed points, breaking uniqueness. In agent terms, the action must be chosen from current memory, not from an observation that has not yet been produced.

This is exactly the policy/update split:

AI interpretation	Process-function interpretation	Operational meaning
Memory $M$	External state passed through the process	What the agent carries across rounds
Action $A$	Input supplied to the environment	What the agent commits to this step
Observation $\Omega$	Output returned by the environment	What the agent can learn after acting
Policy $\pi$	Component independent of same-step observation	Decision before feedback
Update $U$	Component dependent on observation	Learning after feedback
One-step rollout	Process-function contraction	Running the closed loop once

This is why the paper’s first result is not just decorative category theory. It says the familiar “act, observe, update” cycle already has the shape of a higher-order causal construction.

The callback matters later. In a single-agent setting, the fixed-point condition reconstructs the ordinary timeline. In a multi-agent setting, the same mathematical machinery allows us to ask whether a fixed timeline is necessary at all.

The Category Theory Is a Constraint Language, Not Decorative Algebra

The next part of the paper generalizes process functions into a category of types, called $\mathbf{PF}$, and shows that it is $\ast$-autonomous. For readers who do not spend weekends whispering to monoidal categories, here is the useful translation: the paper builds a formal language for composing systems while keeping track of what kind of information flow is allowed.

The category-theoretic results are not empirical evidence. They are the compositional infrastructure that lets the later multi-agent claim be stated cleanly. The important identifications are:

Formal structure	AI-side interpretation	Why it matters
Ordinary function space	Single POMDP-style transformation	Basic input-output behavior
Product / tensor-like composition	Independent components	Decentralized parts remain separated
Observation-independent dec-POMDP type	Each agent’s observation excludes other agents’ same-step actions	No within-step signaling
Multi-input process function	Higher-order multi-party strategy	Can represent non-definite causal order

Observation independence is especially important. In a decentralized POMDP, multiple agents act locally and receive local observations. Observation independence says that agent $i$’s current observation does not depend on agent $j$’s current action for $j \neq i$. This is the deterministic version of a no-signaling constraint inside one environment step.

That constraint sounds restrictive, but it is common in decentralized control. A robot, sensor, branch office, or local decision unit may see its own local state before it sees another unit’s latest action. Communication may happen across rounds, but not magically inside the same instant.

The paper notices that this exact structure is also the natural domain for multi-input process functions used to model indefinite causal order. That is the hinge: observation-independent decentralized POMDPs and indefinite-causal-order process functions share the same formal socket.

Once the sockets match, a new question becomes legal:

Can a decentralized strategy with indefinite causal order outperform every strategy forced into a definite causal order?

Wilson’s answer is yes, in a constructed proof-of-principle POMDP.

The Separation: One Fixed First Mover Costs a Quarter of the Reward Mass

The paper’s concrete separation uses a majority-GYNI game. GYNI stands for “guess your neighbor’s input,” because apparently formal methods researchers also deserve a little mischief.

There are three agents. Each receives one bit, $x_i \in {0,1}$. Together they output three bits, $y_1,y_2,y_3$. The reward rule depends on the majority bit of $x=(x_1,x_2,x_3)$. The paper embeds this game into a deterministic decentralized POMDP:

Element	Construction in the paper	Purpose
Hidden/global state	$x \in {0,1}^3$ plus a counter $k$	Keeps the game input fixed across rounds
Local observation	Agent $i$ observes $x_i$	Enforces observation independence
Local action	Agent $i$ outputs $y_i$	Produces the game answer
Warm-up rounds	First $n$ rounds give zero reward	Separates memory accumulation from rewarded play
Reward rounds	Rounds $n+1$ to $T$ score the majority-GYNI rule	Measures causal-order advantage

The proof then isolates the bottleneck. Under between-rounds decentralization, each agent’s memory at round $t$ remains a function only of its own local bit $x_i$. That is Lemma 6.3. It is not an ablation; it is the information-flow invariant that makes the upper bound possible.

Next comes Lemma 6.4. If one agent’s output is fixed as a function only of its own bit, then no matter what the other two outputs are, the success probability is at most $3/4$ under uniform inputs. The appendix truth table supports this counting argument; it is a verification detail for the majority-GYNI rule, not a separate experiment.

Now combine this with definite causal order. In a definite-ordered multi-input process function, there is always some party that is first in the causal order. That party’s current output cannot depend on the other parties’ current outputs. In the majority-GYNI POMDP, that first party is trapped: its action is ultimately a function only of its own bit. Therefore every rewarded round is capped at expected reward $3/4$.

The paper states the bound as:

$$ J_{\text{definite}} \leq \frac{3}{4}\sum_{t=n+1}^{T}\gamma^{t-1}. $$

Then the indefinite-causal-order strategy uses the Lugano process, known in the process-function literature to win the majority-vote GYNI game perfectly. In this construction, it achieves reward $1$ in every rewarded round:

$$ J_{\text{indefinite}} = \sum_{t=n+1}^{T}\gamma^{t-1}. $$

So the gap is not a vague “maybe better coordination.” It is a clean proof: definite causal order loses at least one quarter of the available discounted reward mass in this constructed task, while an indefinite-order process gets all of it.

Here is how to read the paper’s technical evidence without mistaking it for a product demo:

Paper component	Likely purpose	What it supports	What it does not prove
Proposition 4.2 and Theorem 4.3	Main formal correspondence	Deterministic agent-state policies match one-input process functions up to behavioral equivalence	Learnability or neural implementation
Theorems 5.5–5.7 and Appendix B	Formal compositional infrastructure	POMDP composition, decentralization, and observation independence can be expressed by process-function types	That enterprise systems should implement category theory directly
Lemmas 6.3–6.4	Bottleneck analysis	Definite-order decentralized strategies have a local-information limitation	A broad empirical limit for all multi-agent RL
Theorems 6.5–6.6	Main separation result	Indefinite causal order can strictly outperform definite order in the constructed dec-POMDP	Practical construction, efficient training, or deployment readiness
Appendix C truth table	Verification detail	The $3/4$ counting argument in the GYNI game	Robustness across real-world tasks

No ablation table is hiding here. No benchmark suite. The evidence is proof-based: definitions, correspondences, and a constructed separation. For this paper, that is the right kind of evidence.

What This Means for Business Agent Orchestration

The business implication is not “buy quantum reinforcement learning before your competitor does.” Please do not put that on a slide unless the slide is evidence in a negligence case.

The useful implication is architectural: causal order should be treated as a design variable in multi-agent systems, not as an invisible default.

Most enterprise agent systems are built as ordered pipelines. Data extraction before verification. Verification before decision. Decision before execution. Execution before monitoring. That is often sensible. Sometimes it is also a bottleneck disguised as discipline.

Wilson’s paper gives us a formal reason to be suspicious of fixed order when three conditions appear together:

multiple agents have local observations;
same-step communication is restricted or expensive;
the correct action depends on a mutual dependency pattern that no single first mover can resolve locally.

In those cases, the coordination problem may not be solved by giving the first agent a longer prompt. The first agent is first. That is the problem.

A practical business interpretation looks like this:

Design question	Ordinary orchestration answer	Process-function-inspired answer
Who acts first?	Pick a sequence	Ask whether the task actually admits a first mover
How do agents share information?	Messages between steps	Explicitly separate within-step and between-round dependencies
Why does performance plateau?	Model weakness or missing data	Possible causal-order ceiling
What should be optimized?	Prompts, tools, and routing	Prompts, tools, routing, and causal architecture
What is the ROI relevance?	Better outputs	Cheaper diagnosis of structural coordination failure

This is especially relevant for agentic systems that coordinate evidence rather than merely process documents. Fraud teams, trading systems, incident response agents, supply-chain monitors, robotic fleets, and research assistants all have versions of the same problem: each local component sees part of the world, but the final decision may depend on relationships among parts.

The paper does not say these systems need literal indefinite causal order. It says fixed causal order can be a real mathematical restriction. That is enough to change how we diagnose failures.

A Causal-Order Audit for Multi-Agent Systems

A reasonable takeaway is not to rebuild the software stack around process functions tomorrow morning. The responsible move is simpler: audit where causal order is being assumed.

Audit checkpoint	Question to ask	Warning sign
First-mover constraint	Which component must commit before others act?	One agent repeatedly makes low-confidence early decisions
Observation boundary	What can each component see in the same step?	Local agents act without decisive cross-context
Memory locality	Is memory shared globally or kept locally?	Agents repeat mistakes because relevant state is partitioned
Communication timing	Is communication allowed within a decision cycle or only after it?	Delayed messages arrive after the key decision
Reward bottleneck	Could success require mutual dependency among agents’ outputs?	Prompt improvements help slightly but plateau quickly
Architecture alternative	Can the workflow be reformulated as iterative consistency, negotiation, or joint resolution?	A pipeline is used only because pipelines are easy to draw

The paper’s mechanism suggests a broader discipline for agent design: separate the logical dependency graph from the execution schedule. Sometimes they match. Often they are merely forced to match because the workflow tool wants arrows.

That distinction matters. A process can be implemented sequentially while still conceptually solving a joint fixed-point problem. Conversely, a system can run in parallel while still embedding a hidden first-mover assumption. Parallelism is not the same as causal flexibility. Very annoying, but true.

Boundaries: This Is Not a Drop-In Algorithm, and Definitely Not a Time Machine

The limitations are not cosmetic. They define the correct use of the paper.

First, the framework is deterministic. The agent-state policies, POMDPs, and process functions are treated in a deterministic setting. Most industrial RL and agentic AI systems are stochastic, approximate, neural, and full of operational compromises. The paper’s correspondence is therefore a formal bridge, not an implementation manual.

Second, the separation result is a constructed proof-of-principle. The majority-GYNI dec-POMDP is designed to expose a causal-order gap. That is valuable, but it is not evidence that every decentralized business workflow has a $25%$ reward ceiling waiting to be heroically liberated by category theory.

Third, the paper does not show how to efficiently learn indefinite-causal-order strategies. Wilson explicitly leaves open whether practical observation-independent decentralized POMDPs exist where such strategies outperform traditional definite-ordered ones, and how efficiently those strategies could be constructed or learned.

Fourth, the quantum-AI pathway is speculative in the precise academic sense, not in the LinkedIn sense. The paper motivates a possible fully quantum generalization of POMDPs, where decision-making agents correspond to quantum super-channels or process matrices. It does not establish a near-term quantum advantage for enterprise agents.

The clean boundary is this:

Directly shown by the paper	Reasonable Cognaptus inference	Still uncertain
Deterministic agent-state policies correspond to one-input process functions	Agent architecture can be analyzed as higher-order process composition	How this scales to stochastic neural agents
Observation-independent dec-POMDPs align with multi-input process functions	Some multi-agent tasks should be designed around information-flow constraints, not only model quality	Which real tasks exhibit meaningful causal-order ceilings
Majority-GYNI creates a strict definite-vs-indefinite performance gap	Fixed workflow order can be a structural bottleneck	How to learn or deploy indefinite-order strategies efficiently
The categorical framework composes these objects cleanly	There may be a useful formal language for auditing agent orchestration	Whether practitioners will tolerate the notation without fleeing

The last point is not a theorem, but one does develop instincts.

The Architecture Can Be the Ceiling

The most important lesson of this paper is not that agents can “escape time.” They cannot. Your Kubernetes cluster remains tragically chronological.

The lesson is that time order and information dependency are not the same thing. In ordinary agent engineering, we often collapse them into one workflow graph because the graph is convenient. Wilson’s paper shows, at a formal level, that this collapse can matter. Once decentralized agents are placed inside observation constraints, requiring a definite first mover can reduce attainable reward.

For Cognaptus-style business interpretation, this shifts the question from “How do we make the agent smarter?” to a sharper one:

What causal structure have we forced the agent system to obey, and is that structure part of the task or merely part of our implementation?

That is the useful discomfort. It does not sell a product. It makes a product team ask better questions before spending another month polishing a pipeline that is mathematically condemned to hesitate at the first handoff.

The paper’s contribution is therefore not a new agent recipe. It is a new diagnostic lens: agent behavior, environment interaction, decentralized observation, and causal order can be placed in one formal frame. Once they are in the same frame, architecture stops being background plumbing and becomes part of the optimization problem.

Cognaptus: Automate the Present, Incubate the Future.

Matt Wilson, “Agent policies from higher-order causal functions,” arXiv:2512.10937. https://arxiv.org/abs/2512.10937 ↩︎

Handoffs Are Where Fixed Time Sneaks Into Agent Design#

The Mechanism: An Agent Policy Is a Process You Can Plug an Environment Into#

The Fixed-Point Condition Rebuilds Policy and Memory#

The Category Theory Is a Constraint Language, Not Decorative Algebra#

The Separation: One Fixed First Mover Costs a Quarter of the Reward Mass#

What This Means for Business Agent Orchestration#

A Causal-Order Audit for Multi-Agent Systems#

Boundaries: This Is Not a Drop-In Algorithm, and Definitely Not a Time Machine#

The Architecture Can Be the Ceiling#