A booking agent is not dangerous because it can “reason.” It is dangerous because it can remember the wrong thing, forget the right thing, loop politely forever, or book the flight before the human has actually confirmed. The philosophy department may enjoy debating whether this counts as intention. The operations team has a simpler question: can we know, before deployment, what behaviours this system can produce?
That is the useful centre of Are Agents Probabilistic Automata? A Trace-Based, Memory-Constrained Theory of Agentic AI by Roham Koohestani, Ziyou Li, Anton Podkopaev, and Maliheh Izadi.1 The paper does not try to make agentic AI more magical. It does something more impolite and more valuable: it asks what class of machine an agent becomes once we specify what kind of persistent memory it is allowed to use.
The answer is not “all agents are just finite automata,” nor “LLMs are secretly Turing machines in a trench coat.” The paper’s argument is subtler. It classifies agent architectures by their enforced memory interfaces, then maps those interfaces to the kind of trace language the system can generate and the kind of probabilistic verification model one can reasonably build.
That distinction matters. In the current agent economy, memory is usually sold as a feature. Long-term memory, scratchpads, vector stores, files, task histories, persistent plans: each makes the agent feel more capable. The paper asks the less flattering question: what did that memory do to your assurance story?
The mechanism: memory interface determines behavioural class
The paper’s central mechanism can be stated without ceremony:
- If an agent has only bounded persistent memory, its feasible interaction traces are regular.
- If it has stack-scoped memory, its feasible traces are context-free.
- If it has unbounded read/write persistent memory, its feasible traces become recursively enumerable, with the usual undecidability problems arriving right on schedule.
The important word is “interface.” The authors are not classifying agents by brand name, benchmark score, or whether the demo used a dramatic progress bar. They classify the controller by what memory operations the architecture actually enforces.
A bounded-memory agent has a finite control state and only constant-bounded information that can influence future decisions. Think of a slot-filling booking controller with states such as Get_Destination, Get_Date, Confirm, and Book. If the relevant values are discretised or bounded, all control-relevant persistence can be encoded into a finite state space.
A stack-scoped agent has finite control plus a last-in, first-out stack. This is the natural shape of hierarchical task execution: call a subtask, complete it, return to the caller. It can nest tasks, but it cannot rummage arbitrarily through an ever-growing notebook.
A Turing-complete, or TC, agent has finite control plus an unbounded persistent store that it can read and write without a stack discipline. Files, arbitrary notes, databases, unbounded scratchpads, and open-ended rereading all belong here when they are not structurally bounded.
The result is a clean, slightly brutal table:
| Enforced memory interface | Trace support class | Quantitative model | Business question it enables | Boundary |
|---|---|---|---|---|
| Bounded persistent memory | Regular | Finite Markov chain or MDP after abstraction | What is the probability of reaching an unsafe state, such as booking without confirmation? | Requires finite alphabets, bounded state, and an abstraction that preserves control-relevant distinctions. |
| Stack-scoped memory | Context-free | Probabilistic pushdown model or pushdown MDP | Can hierarchical call-return behaviour stay within safe regions? | Does not automatically prove termination; stack behaviour still needs assumptions or bounded analysis. |
| Unbounded read/write memory | Recursively enumerable | General unbounded verification inherits undecidability | What can be checked only under bounds, monitors, or finite abstractions? | General verification is not recovered by optimism, dashboards, or saying “agentic” in a confident voice. |
This is the paper’s best move. It replaces the vague question “how autonomous is the agent?” with the sharper question “what memory discipline does the architecture enforce?”
Why traces are the right object, not inner monologues
The paper uses traces as its semantic object. A trace is a sequence of observable interactions: perceptions, actions, tool calls, responses, and state transitions depending on the chosen abstraction. This is a better target than the agent’s private reasoning because enterprises rarely need to certify the poetry of a hidden chain of thought. They need to know whether the deployed system can call book, transfer, delete, approve, or notify in the wrong order.
That is also where the paper avoids a common trap. It is not saying an LLM agent is literally a deterministic acceptor from an automata textbook. It is saying that, after defining a controller architecture, an environment, a finite abstraction, and observable traces, the support of the trace distribution falls into known formal classes under known memory constraints.
Support means the set of traces with nonzero probability. Probabilities enter separately. The paper treats the policy component, such as an LLM, as stochastic: it selects among architecturally enabled actions. The automata correspondence classifies which traces are feasible at all; the probabilistic model then supports quantitative questions about how likely certain trace regions are.
That separation is essential. “Can this happen?” and “how likely is it to happen?” are not the same question. An enterprise risk register that confuses the two is not doing risk management. It is doing vibes with columns.
Regular agents: boring is verifiable
The first formal result concerns bounded-memory, or Regular, agents. Under an exact finite abstraction that preserves the control-relevant distinctions, the paper constructs a finite automaton whose language matches the feasible abstract interaction traces. Because the state space and alphabets are finite, the trace support is regular.
This is not glamorous. It is, however, operationally useful.
For a regular booking controller, the architecture can be designed so that every persistent control variable is bounded: destination known or unknown, budget category, confirmation state, search result status, and so on. The resulting system may still use an LLM internally to select phrasing or choose among permitted transitions. But the LLM does not get to invent a new persistent control universe.
That design gives verification something to bite. Once the closed loop is abstracted into a finite model, the analysis can be framed as a Markov chain if the environment is treated as fully probabilistic, or as a Markov decision process if tool outcomes or environment behaviours include nondeterminism.
The business translation is simple: if a workflow does not require unbounded memory, do not give it unbounded memory. A refund approval agent, travel booking agent, claim triage assistant, onboarding assistant, or compliance checklist bot often needs bounded state plus careful tool mediation. It does not always need a self-extending research diary.
This is where the paper’s “right-sizing” principle becomes useful. The weakest sufficient controller is not a philosophical preference. It is an assurance strategy. Smaller behavioural class, smaller trace envelope, better verification leverage.
Context-free agents: hierarchy without a junk drawer
Some tasks need nested structure. A travel agent may need to plan a trip, check a calendar, search flights, compare hotels, return to the original plan, and then ask for confirmation. A software maintenance agent may need to inspect a failing test, enter a debugging subtask, return to the parent repair plan, and then run validation.
That is where stack-scoped memory matters.
The paper’s Context-Free Agent has finite control plus a stack. Its persistent memory is LIFO: the agent can push a task frame, work inside it, and pop back to the caller. The paper shows that the feasible trace support is context-free and that the probabilistic analogue can be represented using probabilistic pushdown models or pushdown MDPs.
This is not merely a technical flourish. It draws a line between hierarchical planning and general memory sprawl. A stack gives the agent structured recursion. It does not give the agent a general-purpose warehouse full of notes it can inspect in arbitrary order.
That distinction is precisely what many agent frameworks blur. A “manager-worker” architecture sounds structured, but if each worker can write arbitrary files, read any previous note, and call itself through a loop with no enforced bound, the stack story has quietly left the building. The label says hierarchy. The memory interface says something else.
For business systems, context-free architectures are appealing when the task genuinely has nested call-return structure: case handling, subtask delegation, recursive document review, or multi-stage planning where each subtask should return control cleanly. The benefit is not that pushdown systems are magically safe. They are not. The paper explicitly notes that termination is not implied merely by pushdown structure. The benefit is that the architecture preserves enough structure for established reachability-style analysis to remain meaningful.
TC agents: capability expands, assurance shrinks
The third class is where modern agent design usually wants to go: unbounded read/write persistent memory. The agent can accumulate files, notes, databases, intermediate artefacts, and arbitrary histories. It can later reread them in a non-stack manner. It can iterate without a global structural bound.
In the paper’s model, this class induces recursively enumerable trace support. Conversely, any recursively enumerable language can be realised by such an architecture. That is the formal way of saying: once you give the controller unbounded general memory, you have crossed into universal-computation territory.
This does not mean TC agents are bad. It means they are expensive to assure in the general case.
Open-ended research assistants, code agents that iteratively inspect repositories, autonomous analysts that build and revise persistent artefacts, and multi-step investigation systems may genuinely need general read/write persistence. The paper is not arguing that every system should be flattened into a finite-state toy. It is arguing that the cost of power should be visible.
For TC agents, general unbounded verification inherits undecidability barriers. Practical analysis therefore moves to bounded executions, finite abstractions, step limits, token budgets, storage quotas, runtime monitors, and restricted interfaces. These are not afterthoughts. They are the price of using the more powerful architecture.
The old draft of this article framed the point as “assurance collapses into empirical testing and runtime monitoring.” That was too blunt. The paper’s claim is more disciplined. Verification does not vanish; it becomes conditional. You can verify bounded horizons. You can verify finite abstractions. You can verify mediated shells. You cannot take an unrestricted, unbounded read/write agent and expect general algorithmic guarantees to appear because procurement asked nicely.
The case study is illustration, not experimental evidence
The paper includes a booking-controller case study. Its purpose is not to prove that one agent performs better than another. There are no benchmark scores, ablation tables, or runtime comparisons hiding in the appendix. The case study is an illustrative mapping: same external task, same broad tool interface, different memory constraints.
That distinction matters because the business lesson does not come from measured accuracy. It comes from architectural consequence.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Formal definitions of trace semantics and abstraction | Main theoretical setup | Shows how agent behaviours can be treated as trace distributions after abstraction | Does not guarantee any abstraction is automatically faithful. |
| Regular-agent theorem | Main correspondence result | Bounded persistent memory yields regular trace support and finite probabilistic models | Does not imply the real system is safe without checking the model and abstraction. |
| Context-free-agent theorem | Main correspondence result | Stack-scoped memory yields context-free support and pushdown-style probabilistic models | Does not imply termination or simple verification for every property. |
| TC-agent theorem | Main correspondence result | Unbounded read/write persistence yields recursively enumerable support and undecidability barriers | Does not say TC agents are unusable; it says general unbounded assurance is not available. |
| Multi-agent extension | Theoretical extension | Shows finite local agents remain bounded under finite composition, but shared unbounded memory can lift the system to TC behaviour | Depends on scheduler, interleaving, atomicity, and shared-memory assumptions. |
| Booking controller | Illustrative case | Demonstrates right-sizing across the three memory classes | Does not compare real products or empirical performance. |
This is why a mechanism-first reading is better than a summary. The paper is not trying to win a leaderboard. It is giving engineering teams a classification device: inspect the memory interface, infer the behavioural class, then choose the appropriate assurance method.
Multi-agent systems: the shared notebook is the trapdoor
One of the paper’s most practical extensions concerns multi-agent systems. If several bounded-memory agents are composed under standard interleaving semantics, their product remains bounded-memory, though the global state space can grow multiplicatively. That is inconvenient but still finite.
Similarly, context-free agents can be treated with pushdown-style composition under suitable conditions, such as separate stacks and finite-state scheduling.
The class changes when agents share an unbounded, arbitrarily readable and writable store. Even if each individual agent has only finite control, the group plus shared memory can simulate a Turing machine. The shared store becomes the tape; the agents provide finite control; the scheduler supplies sequencing.
This is not an exotic edge case. Many practical multi-agent systems are built around shared scratchpads, shared task boards, shared vector stores, shared files, or shared project memory. The system diagram says “collaboration.” The automata-theoretic interpretation may say “congratulations, you built the harder class.”
For enterprise design, this means local simplicity is not enough. A team of simple agents can become globally difficult to verify if the shared memory substrate is unrestricted. Governance therefore has to inspect not only each agent’s prompt and tool list, but also the shared persistence layer: who can write, who can read, what can grow, what is bounded, and what access discipline is enforced.
The business value is assurance architecture, not academic nostalgia
It would be easy to treat this paper as a retrocomputing exercise: Chomsky hierarchy meets agentic AI, everyone nods respectfully, then goes back to shipping a ReAct loop with a database and a prayer.
That would miss the point.
The practical value is an assurance architecture pattern:
- Specify the task interface: perceptions, actions, tool calls, and unsafe regions.
- Decide what persistent information must influence future control.
- Choose the weakest memory class sufficient for that task.
- Enforce that memory class at the framework level.
- Build a finite or pushdown abstraction when available.
- Ask quantitative reachability questions over that model.
- For TC designs, impose bounds, monitors, or mediated shells before making assurance claims.
The paper’s examples of quantitative questions are exactly the kind enterprises should be asking: What is the probability of reaching an unsafe region? What is the maximum probability of an unsafe call under nondeterministic tool outcomes? Can the system call book before confirm? Can it enter an excessive retry loop? Can it approve an action after losing the required context?
Notice what is absent: “Does the agent understand?” Understanding may be a fine research topic. It is not the first control question for a system with API access.
What Cognaptus infers for implementation
The paper directly shows a formal correspondence between memory-constrained controller architectures and automata-theoretic trace classes. It also gives a right-sizing guideline and a booking-controller illustration.
The business inference is that memory permissions should become a first-class part of agent governance. Not a footnote in a prompt. Not a configuration buried in a framework. A design object.
A practical enterprise checklist would ask:
| Governance question | Why it matters |
|---|---|
| Is all persistent control-relevant information bounded? | If yes, a finite abstraction may support finite-state probabilistic checking. |
| Is unbounded persistence restricted to LIFO task frames? | If yes, pushdown-style reasoning may apply to hierarchical workflows. |
| Can the agent write arbitrary artefacts and reread them later? | If yes, the architecture may be TC in the unbounded model. |
| Are tool arguments discretised or abstracted into finite alphabets? | The formal mapping depends on finite observable symbols or faithful abstraction. |
| Can external tools themselves perform unbounded computation? | The paper classifies the controller, not the computational power of external oracles. |
| Is shared memory available across agents? | Shared unbounded memory can lift a bounded multi-agent system into TC behaviour. |
| Are step limits, token budgets, retry limits, and storage quotas explicit? | Bounds can recover finite-horizon analysis even for otherwise powerful systems. |
The ROI is not “formal methods make AI safe.” That sentence should be confiscated and recycled. The ROI is narrower and better: formal memory constraints reduce the behavioural envelope, make failures easier to state, and allow some risk questions to be answered with models rather than anecdotes.
The boundary: abstraction is where the hard work hides
The paper’s limitation section is important because it prevents the theory from becoming decorative certainty.
First, LLM context is itself a form of memory. If a supposedly bounded controller quietly includes a growing conversation history, retrieved documents, or hidden state in the prompt, then the claimed memory discipline may not hold. The finite-state or pushdown mapping requires that all control-relevant information respect the intended interface.
Second, abstraction fidelity is non-negotiable. If the abstraction omits variables that affect enabled actions or tool arguments, the resulting model may be unsound for the property being checked. A finite model that forgets the thing that causes the failure is not elegant. It is wrong.
Third, tool and environment behaviour matters. The paper assumes finite or abstracted alphabets and treats tools and environments as nondeterministic or computable processes. It does not bound the computational power of external tools. If a controller calls an external system that performs open-ended computation, the controller classification alone is not the whole assurance story.
Fourth, multi-agent semantics require care. Scheduling, interleavings, atomic read/write assumptions, and shared stores all influence the induced trace model. A neat architecture diagram is not an operational semantics.
Finally, decidable does not mean small. Finite-state models can explode. Pushdown models can still be hard. Practical use requires abstraction, compositional reasoning, symbolic techniques, and engineering restraint. Yes, restraint. A rare enterprise feature.
The memory of thought is still engineering
The paper’s quiet provocation is that “agentic” behaviour may be less mysterious when viewed through memory. The agent’s apparent intelligence expands when it can preserve, revisit, and transform information across time. But the assurance problem expands with it.
Bounded memory gives regular behaviour and finite probabilistic models. Stack-scoped memory gives hierarchical structure and pushdown models. Unbounded read/write memory gives maximum flexibility and the old undecidability demons, now wearing SaaS pricing.
The lesson is not to avoid powerful agents. It is to stop granting power accidentally.
If the task is a bounded workflow, build a bounded controller. If the task is hierarchical, enforce stack discipline. If the task genuinely requires open-ended persistence, admit that it is in the harder class and design the assurance case around bounds, monitoring, abstraction, and interface mediation.
In other words: memory is not just where the agent stores its thoughts. It is where the business stores its risk.
Cognaptus: Automate the Present, Incubate the Future.
-
Roham Koohestani, Ziyou Li, Anton Podkopaev, and Maliheh Izadi, “Are Agents Probabilistic Automata? A Trace-Based, Memory-Constrained Theory of Agentic AI,” arXiv:2510.23487, 2025. https://arxiv.org/abs/2510.23487 ↩︎