When Memory Lies and Rules Save It: Rethinking LLM Agents in Closed Worlds

Memory is usually sold as the adult upgrade for LLM agents.

Give the agent a past. Give it a vector database. Give it episodes, reflections, mistakes, summaries, and a long enough context window to remember every tiny embarrassment. Surely it will become more reliable.

The RPMS paper is useful because it interrupts that comforting story with a less fashionable point: memory can make an agent worse when the world has hard action rules.¹

That is not a small caveat. It is the difference between a chatbot that says something wrong and an automation agent that repeatedly does the wrong thing while sounding perfectly reasonable. In closed-world environments—household simulators, science-task simulators, warehouse workflows, ticketing systems, robotic cells, ERP procedures—the problem is rarely that the model has no idea what should happen next. The problem is that the next action must be executable under the current state.

A human can say, “put the apple in the fridge.” An agent must know whether it is already at the fridge, whether it is holding the apple, whether the fridge is open, whether the environment expects put apple in/on fridge 1, and whether some previous failed action quietly invalidated its belief about the world.

That is where memory starts lying. Not because the stored episode is false, but because it may be true for the wrong state.

Closed worlds punish plausible actions

The paper’s core contribution is not merely that RPMS improves scores on ALFWorld and ScienceWorld. Scores matter, of course; benchmarks are how machine learning papers pay rent. But the more interesting contribution is diagnostic.

The authors identify a coupled failure cycle in embodied LLM planning:

Failure mode	What it means	Why it compounds
P1: invalid action generation	The agent proposes an action that violates environment preconditions.	The action may look linguistically correct while being impossible in the simulator.
P2: state drift	The agent’s internal belief diverges from the actual environment state.	Later decisions are made from a corrupted view of what is held, open, visible, or completed.

The two failures feed each other. If the agent’s state is wrong, it generates invalid actions. Invalid actions often produce sparse feedback such as “Nothing happens.” Sparse feedback then gives the agent little information to repair its state. The next action is built on a worse belief. The loop is not dramatic. It is worse: it is mechanically boring and therefore easy to miss.

This matters because many agent designs still treat planning as a reasoning problem. They ask the model to think harder, reflect after failure, retrieve similar experiences, or use a bigger model. RPMS asks a more operational question: before the agent acts, does it know whether the action is valid here?

That “here” is doing the work.

A memory from a successful trajectory may say that heating an object worked after going to the microwave. Fine. But if the current agent is not holding the object, or is in the wrong location, or has confused the destination with the transformation tool, the remembered sequence becomes dangerous advice. In a closed world, experience is not portable unless its preconditions travel with it.

RPMS is not memory plus rules; it is conflict management

RPMS stands for Rule-Augmented Memory Synergy. The name sounds like another hybrid architecture, but the paper is sharper than the name. RPMS does not simply dump rules and retrieved examples into the prompt and hope the model politely reconciles them. It builds a decision pipeline around three questions:

What does the agent currently believe about the state?
Which rules determine whether actions are executable?
Which memories are compatible with the current state, and what happens when memory conflicts with rules?

That sequence is the mechanism.

The architecture has five main steps:

Observation + goal
      ↓
Lightweight belief-state update
      ↓
Rule retrieval + episodic memory retrieval
      ↓
State-consistent filtering of memory
      ↓
Rules-first arbitration
      ↓
LLM action selection

The important design choice is restraint. RPMS does not try to reconstruct a complete symbolic world model. For ALFWorld, it tracks only the variables needed for action validity and memory screening: current location, hand state, open or closed containers, and last observed object locations. The update is deterministic, based on observation text patterns rather than another LLM call. In ScienceWorld, the same principle becomes room, inventory, visible objects, state flags, and a sub-goal progress pointer.

This is not glamorous. Good. Glamour is often where agent reliability goes to die.

The belief state is intentionally small because it serves a narrow function: make preconditions checkable. If the environment requires the agent to hold an object before heating it, the agent must track whether it is holding the object. If closed containers must be opened before access, the agent must track container state. Anything else is optional until proven necessary.

The rule manual turns “commonsense” into executable constraints

The paper’s rule system is hierarchical:

Rule tier	What it contains	Operational role
Universal rules	General search, failure recovery, and state-management principles.	Prevent repetitive wandering and basic state mistakes.
Domain rules	Task-type procedures, such as find → take → transform → place.	Gives the agent a canonical workflow for the goal type.
Environment rules	Action semantics and precondition-effect pairs specific to ALFWorld or ScienceWorld.	Encodes what the simulator actually accepts, not what ordinary language suggests.

The third tier is where many LLM agents quietly fail. In ALFWorld, heat and cool are atomic actions: the object remains in hand. A model might intuitively try to put an apple into a microwave or fridge before transforming it. That sounds reasonable in the real world. In the simulator, it is wrong.

RPMS injects rules such as: heating requires holding the object and being at the microwave; heating keeps the object in hand; do not manually put the object into the microwave. This is not “old symbolic AI” replacing LLMs. It is the boring layer that tells the LLM which actions are legal.

For business systems, this is the closest part of the paper to immediate product design. Most operational workflows already have hidden preconditions:

Business workflow	Hidden precondition example	RPMS-style lesson
Customer support automation	A refund cannot be issued before verifying order status and policy eligibility.	Retrieve policy and case history, but enforce eligibility rules before action.
Warehouse task planning	An item cannot be moved before location, inventory, and equipment availability are confirmed.	Track state explicitly; do not rely on plausible natural-language plans.
Finance operations	A payment cannot be approved before role, amount, vendor, and documentation checks pass.	Treat rules as executable constraints, not as background context.
Software agents	A deployment cannot proceed before tests, permissions, and environment status are validated.	Make state gates explicit before allowing tool calls.

The paper does not prove these business deployments. It proves the mechanism in simulated environments. The inference for business use is narrower but valuable: when actions have preconditions, rules are not a decorative safety layer. They are part of the agent’s reasoning substrate.

Memory becomes useful only after state screening

RPMS stores three types of ALFWorld memories: success snippets, failure lessons, and verified schemas. After the first learning round, the ALFWorld memory contains 233 entries: 64 success snippets, 157 failure lessons, and 12 verified schemas. ScienceWorld adds “critical failure” entries for actions that can immediately terminate the episode with a score penalty, such as focusing on the wrong object.

A weaker architecture would retrieve memories by goal similarity and inject them into the prompt. RPMS adds state-consistent filtering. Candidate memories must pass a compatibility check against the current belief state. In ALFWorld, the lightweight signature checks hand occupancy. If a memory assumes the agent is holding something and the current belief says the hand is empty, the memory is filtered out.

This may sound crude. It is crude in the right way. The authors are not claiming that hand occupancy captures full semantic equivalence. They are using a cheap compatibility filter for a high-frequency failure mode. The point is not to make retrieval philosophically pure. The point is to stop obviously incompatible experience from entering the decision context.

That is a useful correction to the way many teams discuss agent memory. The question is not only “Can we retrieve relevant past episodes?” It is “Relevant under which state?”

A memory can match the goal and still violate the situation.

Rules-first arbitration is the part product teams should steal

After filtering, RPMS still assumes that surviving memory may conflict with rules. The arbitration module handles this in two layers:

Arbitration layer	What it does	Why it matters
Hard filter	Removes memory entries whose suggested actions directly violate active rules.	Prevents recalled experience from overriding executability.
Soft annotation	Flags possible but non-direct conflicts for the LLM to judge.	Preserves useful advice without pretending every ambiguity can be resolved mechanically.

This is the paper’s most transferable design principle. Memory is treated as evidence, not authority. Rules get priority when there is a direct conflict.

That sounds obvious until one looks at many agent prototypes, where retrieved documents, remembered conversations, tool traces, and policy snippets are all thrown into one context window and the LLM is trusted to sort it out. This is not governance. This is a pile.

RPMS imposes a hierarchy: hard constraints first, experience second, LLM judgment last where ambiguity remains. That hierarchy is why the paper deserves attention beyond embodied planning.

In a business process, the equivalent hierarchy would be:

Regulatory or policy constraint
      > workflow state
      > retrieved precedent or case history
      > model-generated recommendation

The point is not to make the agent less intelligent. It is to stop intelligence from being applied to an invalid action space.

The main ALFWorld result: rules dominate the gain

The headline ALFWorld result uses 134 unseen tasks under a single-trial protocol. With Llama 3.1 8B, ReAct reaches 35.8% success. RPMS reaches 59.7%, a +23.9 percentage-point improvement. Scaling the backbone preserves the direction: Llama 3.1 70B improves from 72.4% to 88.1%, and Claude Sonnet 4.5 improves from 86.6% to 98.5%.

Those numbers are the main evidence that the architecture improves execution under matched protocol conditions. But the ablation is more informative than the headline.

Condition	Rules	Memory	ALFWorld success rate
Baseline	Off	Off	35.8%
Memory-only	Off	On	41.0%
Rules-only	On	Off	50.7%
Full RPMS	On	On	59.7%

The first lesson is not subtle: rules contribute more than memory. Rules-only adds +14.9 percentage points over baseline. Memory-only adds +5.2 percentage points. Full RPMS adds +23.9 percentage points.

The second lesson is more interesting. The combined improvement is larger than the simple sum of the two individual gains: +23.9 points versus +20.1 points. That suggests positive interaction, but it should not be oversold as magic synergy. The practical reading is cleaner: rules make memory safer and more usable; memory adds experience once rules constrain what experience is allowed to recommend.

That is the mechanism-first reading. The paper is not saying “memory plus rules is good.” It is saying memory becomes much less stupid when rules decide what it is allowed to do. A modest distinction, but in agent engineering, modest distinctions often save the system.

Per-task results expose the misconception

The accepted misconception for this article is worth keeping because the paper supports it directly: adding memory does not automatically improve an agent.

The per-task ALFWorld table shows why.

Task type	Baseline	Memory-only	Rules-only	Full RPMS	Interpretation
Look	55.6%	11.1%	33.3%	66.7%	Unfiltered memory severely hurts a simple task type; arbitration recovers performance.
Place	33.3%	70.8%	75.0%	79.2%	Experience and rules both help procedural placement.
Clean	29.0%	38.7%	41.9%	54.8%	Full system provides the strongest gain.
Cool	42.9%	52.4%	61.9%	47.6%	A real exception: Full RPMS underperforms Rules-only.
Heat	30.4%	26.1%	52.2%	65.2%	Rules are especially useful for strict transformation semantics.
Two-object	29.4%	41.2%	35.3%	41.2%	Memory helps, but full system does not exceed memory-only here.

The Look result is the cleanest warning sign. Memory-only drops from 55.6% to 11.1%. That is not a rounding error wearing a hat. It means retrieved experience can distract the agent from simple state-grounded action. Meanwhile, Full RPMS recovers Look performance to 66.7%, supporting the argument that memory must be filtered and arbitrated.

The Cool result is also important because it prevents the article from becoming a victory lap. Rules-only reaches 61.9%, while Full RPMS drops to 47.6%. The authors attribute this to conflict between stored failure lessons and environment-specific action constraints. In business language: a case history can conflict with policy logic, and the conflict resolver may still be too coarse.

That exception makes the paper more credible, not less. It shows that the architecture is not a universal “add memory safely” button. It is a controlled design whose components still need better granularity.

The arbitration test shows why naive combination fails

The arbitration comparison isolates what happens when rules and memory are both available but conflict handling changes.

Arbitration mode	ALFWorld success rate	Likely purpose of the test
None	47.8%	Ablation of conflict management; tests whether simple combination works.
Soft-only	53.0%	Tests whether warnings without hard removal are enough.
Hard-only	55.2%	Tests whether removing direct rule violations helps.
Hard + Soft	59.7%	Tests the full arbitration mechanism.

The key comparison is not merely 59.7% versus 47.8%. It is 47.8% versus Rules-only at 50.7%. Naively combining rules and memory performs worse than using rules alone.

That is the paper’s most product-relevant result. More context can reduce reliability when the architecture lacks priority rules. This should make anyone building retrieval-heavy agents slightly uncomfortable, which is healthy. A little discomfort is cheaper than a failed deployment.

Hard filtering removes memories that directly violate rules. Soft annotation preserves potentially useful but suspicious entries with warnings. The full version performs best because it separates two kinds of conflict: contradictions that should be eliminated, and ambiguities that should be surfaced.

For business automation, that maps neatly onto operational design:

Conflict type	Example	Better handling
Direct violation	A retrieved precedent recommends approving a refund that current policy forbids.	Remove it from the action recommendation path.
Ambiguous tension	A past case suggests escalation, but current status is incomplete.	Flag it and require further verification.
No conflict	Past case matches current state and policy.	Use it as guidance.

This is not a fancy agent trick. It is basic decision hygiene, finally applied to LLM planning.

ScienceWorld supports transfer, not a model-controlled comparison

The paper adapts RPMS to ScienceWorld, a different environment with longer horizons, a richer command vocabulary, and continuous scores rather than binary success. The evaluation uses GPT-4 on 241 episodes across 26 tasks. The results follow the same direction:

Condition	Rules	Memory	ScienceWorld average score
Baseline	Off	Off	44.9
Memory-only	Off	On	46.0
Rules-only	On	Off	51.3
Full RPMS	On	On	54.0

This is transfer evidence for the architecture. It is not a clean model-controlled comparison with ALFWorld, because ALFWorld primarily uses Llama 3.1 8B while ScienceWorld uses GPT-4. The paper says this clearly, and the article should not pretend otherwise.

The pattern still matters. Rules again contribute the larger single-factor gain: +6.4 points versus +1.1 for memory-only. Full RPMS reaches 54.0, above both single-factor conditions. The ScienceWorld adaptation also adds critical-failure memories, because some actions can immediately produce a severe negative outcome. That is a useful implementation detail rather than a second thesis: when the environment has catastrophic local mistakes, memory should emphasize the mistakes that end the episode, not every ordinary inconvenience.

The cross-environment result supports the general mechanism: action validity and state-aware mediation travel better than ungrounded recall.

How to read the evidence without over-reading it

The paper contains several result types. They do not all support the same claim.

Evidence item	Likely purpose	What it supports	What it does not prove
Main ALFWorld comparison across Llama 8B, Llama 70B, and Claude Sonnet 4.5	Main evidence	RPMS improves single-trial success under matched prompting and decoding within each model.	It does not prove RPMS beats all prior systems under their protocols.
Factorial ablation on ALFWorld	Ablation	Rules dominate memory as a single factor; full system outperforms either alone.	It does not show memory is always helpful.
Per-task ALFWorld analysis	Diagnostic evidence	Memory-only can help some tasks and harm others.	It does not fully explain every task-specific exception.
Arbitration comparison	Ablation of conflict management	Naive rule-memory combination is weaker than rules-only; hard plus soft arbitration matters.	It does not prove the arbitration policy is optimal.
ScienceWorld adaptation	Transfer evidence	The same rule-dominant pattern appears in a structurally different environment.	It is not a controlled backbone comparison against ALFWorld.
Execution efficiency appendix	Implementation/effect analysis	RPMS reduces average steps from 37.3 to 30.9, mainly by avoiding wasted failed trajectories.	It does not imply successful tasks become much shorter.
Failure-mode appendix	Diagnostic analysis	Rules shift failures from timeouts toward earlier wrong completions, suggesting more decisive behavior.	It does not mean early termination is always desirable.
Learning curve appendix	Sensitivity/robustness test	Memory gains saturate; rules amplify memory effectiveness; disabling state-consistent filtering hurts.	It does not solve bootstrapping memory without prior trajectories.
Statistical significance appendix	Statistical support	Rules-only and soft-only RPMS show significant paired gains over baseline; rule-grounded system significantly beats memory-only.	The full hard+soft variant’s exact paired test is not reported in the same way.

This distinction matters because papers often tempt readers to flatten every table into “the method works.” A better reading is: the main result shows improvement; the ablations identify the mechanism; the appendix explains failure behavior and robustness; the ScienceWorld test supports transfer but not universal generality.

The business value is not “better memory”; it is cheaper invalid-action prevention

For business automation, the paper’s value is not that companies should copy RPMS line by line. ALFWorld and ScienceWorld are simulators. Real workflows have messy APIs, ambiguous policy language, changing rules, human exceptions, and delightful edge cases invented by operations teams at 4:55 p.m. on a Friday.

The more useful lesson is architectural:

Do not ask an LLM agent to choose the best action
until the system has defined which actions are valid.

This changes the design priority for enterprise agents.

Instead of starting with a large memory system, teams should first define:

The action vocabulary: what the agent can actually do.
The preconditions for each action: what must be true before the action is allowed.
The belief state: what the system must track to evaluate those preconditions.
The retrieval filter: which past cases are compatible with the current state.
The arbitration hierarchy: which source wins when rules, memory, and model preference conflict.

Only after that does a vector database become useful. Otherwise, it is just a very fast way to retrieve plausible distractions.

The ROI pathway is also more modest than “higher benchmark score.” In operational systems, invalid actions are expensive because they trigger rework, human review, workflow dead ends, customer frustration, compliance risk, or silent non-completion. RPMS-style design reduces waste by catching invalidity before action selection, not by making the model sound more competent after failure.

That is a less glamorous value proposition. It is also easier to explain to a CFO.

Where the result applies, and where it does not

The paper is strongest for closed-world or semi-closed-world environments. These are settings where actions are discrete, preconditions can be listed, and state can be tracked well enough to validate action feasibility.

Good fits include:

Good fit	Why RPMS-style design helps
Robotic or warehouse task planning	Physical actions have location, object, tool, and safety preconditions.
Customer-service workflow execution	Policies and case status determine which actions are valid.
Internal IT automation	Tool calls require permissions, environment state, and dependency checks.
Finance and procurement operations	Approval actions depend on document completeness, authority, thresholds, and audit trails.
Software engineering agents	Code modification and deployment steps have test, branch, permission, and environment gates.

Weak fits are equally important. RPMS is less directly applicable to open-ended research, creative writing, market interpretation, or advisory tasks where the action space is not fixed and “validity” is a matter of judgment rather than precondition satisfaction. One can still borrow the governance idea, but the rule manual becomes harder to specify and easier to abuse.

The paper also leaves practical challenges:

The rule manual is partly human-authored, so incorrect rules can systematically mislead the agent.
The memory module requires prior interaction trajectories; it does not bootstrap useful experience from nothing.
The state-consistent filter is intentionally lightweight and may miss finer incompatibilities.
The arbitration works at the memory-entry level, which can be too coarse when one retrieved entry contains both useful and conflicting information.
Real-world systems need rule auditing, memory review, sandbox validation, and human escalation paths before deployment.

These limitations do not weaken the core insight. They define the engineering agenda.

The useful agent is not the one that remembers more

RPMS is a good paper because it points to an unromantic truth about agents: reliability often comes from saying no.

No, do not use that memory; the current state does not match.

No, do not follow that remembered sequence; it violates an action rule.

No, do not let the LLM decide freely; the environment has preconditions.

The benchmark gains are useful. The architectural lesson is more useful. Closed-world agents fail when plausible language drifts away from executable action. Memory does not solve that drift by itself. Sometimes memory accelerates it.

The replacement principle is simple:

Memory should advise.
Rules should constrain.
State should decide whether the advice applies.

That principle is not only for embodied simulators. It is a design pattern for business automation agents that must operate inside policies, tools, forms, APIs, and workflows. In those worlds, “Nothing happens” is not a cute simulator response. It is a ticket stuck in limbo, a refund mishandled, a deployment blocked, or a process that quietly fails while everyone assumes the agent handled it.

More memory will not fix that.

More discipline might.

Cognaptus: Automate the Present, Incubate the Future.

Zhenhang Yuan, Shenghai Yuan, and Lihua Xie, “RPMS: Enhancing LLM-Based Embodied Planning through Rule-Augmented Memory Synergy,” arXiv:2603.17831, 2026. https://arxiv.org/abs/2603.17831 ↩︎

Closed worlds punish plausible actions#

RPMS is not memory plus rules; it is conflict management#

The rule manual turns “commonsense” into executable constraints#

Memory becomes useful only after state screening#

Rules-first arbitration is the part product teams should steal#

The main ALFWorld result: rules dominate the gain#

Per-task results expose the misconception#

The arbitration test shows why naive combination fails#

ScienceWorld supports transfer, not a model-controlled comparison#

How to read the evidence without over-reading it#

The business value is not “better memory”; it is cheaper invalid-action prevention#

Where the result applies, and where it does not#

The useful agent is not the one that remembers more#