Policy Gradients Grow Up: Teaching RL to Think in Domains

The problem is not that RL cannot plan. It is that it keeps learning the wrong object.

A warehouse robot can learn to pick up box A from shelf B and move it to station C. Very impressive, until tomorrow’s warehouse has different boxes, different shelves, and a new station name. The action label changed. The task structure did not.

That is the small, brutal problem behind a large part of reinforcement learning generalization. If a policy learns actions as if they were stable labels, it learns something that disappears when the instance changes. In classical planning, the objects change from problem to problem. The number of blocks, packages, keys, rooms, passengers, or locations changes. The names of ground actions change with them. A policy that says “apply this action” is therefore not a general policy. It is a local habit wearing a lab coat.

The paper Learning General Policies with Policy Gradient Methods by Simon Ståhlberg, Blai Bonet, and Hector Geffner attacks this problem directly.¹ Its main move is deceptively simple: use standard actor-critic policy-gradient machinery, but make the policy score state transitions rather than instance-specific ground actions. The policy does not primarily ask, “Which named action should I take?” It asks, “Which successor state is a good next state?”

That sounds like a technical detail. It is not. It is the paper’s central mechanism.

The usual temptation, when reinforcement learning fails to generalize, is to change the reinforcement learning algorithm. Try PPO. Try SAC. Tune harder. Add experience replay. Perform the sacred ritual of baseline multiplication. The paper’s more useful message is sharper: in these planning domains, the failures are often not “RL failed” failures. They are representation failures, logical expressiveness failures, or objective mismatch failures.

That distinction matters because it changes what one should fix.

Generalization starts by refusing to learn instance-specific action names

Generalized planning is about learning a policy that works across many instances of the same domain. Not one Blocks problem, but Blocks problems with different numbers of blocks and different goal configurations. Not one Logistics instance, but many arrangements of packages, trucks, cities, airplanes, and destinations.

A classical planning domain provides a compact symbolic description: predicates, objects, initial facts, goal facts, and action schemas. A specific instance grounds these into actual states and actions. The trouble is that ground actions are not stable across instances. With five packages, there are different possible actions than with fifty. With a different set of cities, the action names and arguments change again.

The authors adopt an old but powerful idea from generalized planning: represent general policies as classifiers over transitions $(s, s’)$. A good transition is one that moves the system toward the goal without falling into dead ends or cycles. If every non-goal state has at least one good transition, and following good transitions cannot loop forever, the policy solves the class of problems.

This is where the actor-critic adaptation enters.

The policy is still a stochastic policy. The critic is still a value estimator. The updates still look like actor-critic updates. But the “action” seen by the policy is effectively a successor state. In a deterministic planning model, each applicable action leads to a successor state, so the policy can assign probabilities over possible successor states instead of over action names.

That shift makes the learning target more general. A transition from “package not delivered” to “package loaded into the right transport chain” can be recognized across different objects and instance sizes. A literal ground action cannot.

The paper implements this using graph neural networks adapted to relational structures. Planning states are sets of atoms: things like clear(block1), at(package2, cityA), or their goal-augmented versions. The GNN computes embeddings for objects by passing messages through the relational structure induced by predicates and atoms. These object embeddings feed two readouts: one for the value function $V(s)$ and one for the transition-scoring policy $\pi(s’ \mid s)$.

The result is a hybrid that neither symbolic-AI nostalgia nor deep-RL boosterism can easily own. The system uses a symbolic planning language to define the state space and successor states. It uses neural function approximation to learn value and policy functions. The grown-up part is that it does not pretend these two pieces are interchangeable.

The mechanism is a three-part bargain

The paper’s approach works because three choices reinforce one another.

Design choice	Technical role	Why it helps generalization	What it still assumes
Score successor states instead of ground actions	Policy is defined over $(s, s’)$ transitions	The policy target survives changes in object names and instance size	The system must know or generate possible successor states
Encode states as relational structures with a GNN	Object embeddings are computed from predicates and atoms	Same network can process different numbers of objects	GNN expressiveness is limited by what its message passing can represent
Train with actor-critic updates	Standard policy-gradient machinery learns the transition policy and value function	Avoids combinatorial feature-pool search used in some logical methods	Training still depends on generated reachable states and a known lifted model

This table is also the business translation, before anyone gets too excited and starts naming a product “GeneralAgentGPT Enterprise Ultimate.” The paper is not saying that raw RL agents can wander through a company’s ERP system and spontaneously learn reusable operations. The paper is saying that when a task is represented in a stable relational language, policy-gradient methods can learn policies that transfer across many instances of that task.

That is narrower. It is also more useful.

In enterprise automation, the analogous move is to avoid training agents on brittle screen coordinates, button names, or one-off workflow scripts. Instead, represent the workflow as a state transition system: invoice received, vendor matched, exception flagged, approval pending, payment scheduled. The labels of individual documents change. The workflow relations do not, or at least they change less often. The agent’s job should be to prefer transitions that move the case toward completion without unsafe loops.

This is a Cognaptus inference, not a direct result of the paper. The paper tests symbolic planning benchmarks, not messy enterprise software. But the mechanism maps cleanly: stable state language first, learning algorithm second. Reverse the order and you get an expensive demo.

The main evidence is coverage, not a leaderboard beauty contest

The experiments cover ten planning domains: Blocks, Blocks-multiple, Delivery, Grid, Gripper, Logistics, Miconic, Reward, Spanner, and Visitall. Training uses small instances; testing uses larger or different instances from the same domain. The authors evaluate learned policies in both stochastic and deterministic modes. Coverage means how many test instances reach the goal within the step limit. Plan quality compares learned plan length with optimal plan length when Fast Downward can compute the optimal reference.

The main evidence is Table 1. Its purpose is not merely to say “our method beats some baseline.” The more interesting purpose is diagnostic: show where the mechanism generalizes out of the box, then use the failures to identify which part of the mechanism is under strain.

The headline result is strong but uneven. Standard Actor-Critic reaches deterministic coverage of 87% overall across 220 test instances, while All-Actions Actor-Critic reaches 88%. In six of the ten domains, the learned policy solves all or nearly all test instances in at least one mode. Blocks, Blocks-multiple, Miconic, and Visitall hit perfect or near-perfect coverage. Delivery and Gripper also perform strongly, with All-Actions Actor-Critic solving all deterministic test cases in both.

That is the “yes” part of the paper: policy-gradient methods can learn nearly perfect general policies in several classical planning domains when the policy is represented correctly.

But the more informative part is the “not always.” Grid, Logistics, Reward, and Spanner do not reach near-perfect coverage in the main setup. Logistics is especially revealing. Standard Actor-Critic in deterministic mode solves 17 of 22 Logistics instances, but with terrible plan quality: $77.4 = 23839 / 308$ over the subset where the optimal planner solves the reference cases. All-Actions Actor-Critic in deterministic mode solves only 8 of 22 Logistics instances, also with poor plan quality. That is not a small tuning blemish. That is the system telling us something structural.

The paper listens.

The failures are not random; they point to the representation

The authors identify two major causes behind the difficult domains.

The first is GNN expressiveness. The relational GNN can represent certain logical features, roughly those expressible in the two-variable fragment of first-order logic with counting, often denoted $C^2$. Some planning features require richer relations, such as $C^3$. In Logistics, the policy needs to know whether a vehicle carrying a package is in the city where the package must ultimately be delivered. That relation involves the package, the vehicle, and the city. A GNN that cannot express the necessary three-way relation is not going to learn the right general rule by sheer optimism. Neural networks are powerful, yes. They are not exempt from representational geometry. Annoying, but useful.

The second cause is the optimality/generalization tradeoff. Reinforcement learning usually optimizes expected cost, which in these deterministic planning settings often means shorter plans. But some domains admit compact general policies without admitting compact optimal general policies. The paper points to Logistics and Grid as examples where optimal planning is NP-hard, while non-optimal general policies can still exist. If the learning objective keeps pushing toward optimality, it may push away from the kind of policy that generalizes.

This is the paper’s strongest conceptual correction. The business equivalent is familiar: the shortest workflow is not always the most reusable workflow. A human team often standardizes on a slightly longer but robust procedure because it works across more cases and is easier to audit. A learning system optimizing local efficiency can miss the general operating policy. Congratulations, the machine has rediscovered bad process design, but faster.

The Logistics repair is the paper’s cleanest lesson

The most important experimental extension is the Logistics repair. This is not a separate thesis; it is an ablation-like diagnostic intervention. The authors ask: if the failure is caused by a missing representational feature and a bad objective, can we fix those while keeping the basic actor-critic method?

First, they add derived predicates indicating whether a package is in the correct city, regardless of whether it is on a plane, a truck, or at an incorrect location within that city. This gives the GNN access to a feature it could not easily infer. With this added predicate, Logistics reaches 91% deterministic coverage, with plan quality $4.53 = 1612 / 356$ over 18 reference-solvable cases.

That is already a major improvement, but not enough. The plans are still much longer than optimal.

Second, the authors change the training objective. Instead of optimizing expected cost, the modified version optimizes the probability of reaching the goal without entering a cycle. The paper notes that this involves an additional parametric function approximating that probability. With both the derived predicate and the new reachability-without-cycles objective, Logistics reaches 100% deterministic coverage with plan quality $1.11 = 410 / 368$ over 19 reference-solvable cases.

This is the part executives should read twice, preferably before asking whether “more RL” will solve their automation failures.

The algorithm did not need to become more fashionable. The state language needed one missing concept. The objective needed to stop worshipping shortest paths. Once those two pieces were corrected, the same basic learning family became far more effective.

Intervention	Likely purpose	Result reported	What it supports	What it does not prove
Main actor-critic experiments across ten domains	Main evidence	87–88% deterministic coverage overall; near-perfect coverage in six domains	Policy-gradient methods can learn general policies when policies score transitions	Does not show universal RL generalization
Trajectory-length comparison in Gripper	Sensitivity / implementation test	Longer sampled trajectories hurt performance in this setting	For generalized planning with no privileged initial states, one-step transition sampling can be better	Does not prove long rollouts are generally bad in RL
Logistics derived predicate	Diagnostic repair / representation ablation	91% coverage, plan quality 4.53	Missing relational features explain part of the failure	Does not remove the objective mismatch
Logistics derived predicate plus reachability objective	Diagnostic repair / objective ablation	100% coverage, plan quality 1.11	Generality may require optimizing safe reachability, not shortest expected cost	Does not prove the same repair works in all domains
Spanner tabular critic test	Exploratory diagnostic	100% coverage, plan quality 1.09	Failure likely relates to insufficient sampling of dead-end states	Tabular evaluation does not scale beyond small instances

The Spanner result deserves a quick note because it prevents a lazy interpretation. The authors observe that Spanner failures occur when test instances have more spanners or nuts than training instances. They test a tabular version of All-Actions Actor-Critic, storing transition probabilities and solving the linear Bellman equation for the policy value. That gives accurate values for dead-end states and reaches 100% coverage in both stochastic and deterministic modes, with plan quality $1.09 = 315 / 290$ over 10 reference-solvable cases.

This does not mean tabular critics are the answer. The authors explicitly note that tabular evaluation only works for small instances and only worked in one other domain, Visitall. Its value is diagnostic: the original actor-critic versions likely did not sample dead-end states enough. Again, the failure is interpretable.

That is rare enough in deep RL that one should pause and enjoy it.

Why one-step transition sampling beats longer rollouts here

A smaller but revealing result concerns trajectory length. In ordinary RL practice, sampling longer trajectories is common. You start somewhere, roll out several steps, update along the way, repeat. In this paper’s generalized planning setup, the authors instead use sampled trajectories of length one: sample a state, sample or consider successor states, update, then move on to another sampled state.

They report that longer sampled trajectories were detrimental in their setting, including in vanilla and tuned PPO implementations they tried. Figure 1, using Gripper, supports this as a sensitivity-style result rather than a main claim about all reinforcement learning.

The reason is tied to the problem structure. Generalized planning does not privilege one initial state. The goal is to learn a policy that works across a broad class of reachable states and instances. Long rollouts can over-emphasize the distribution induced by the current policy from particular starts. One-step transition sampling keeps the learner focused on local transition quality across the broader sampled state space.

For business automation, the analogy is useful but should not be overextended. Training an agent only on full end-to-end workflow traces can overfit the most common paths. If the real goal is robust handling across many workflow states, it may be better to train and test local transition decisions: from exception state to resolution state, from incomplete form to validation request, from matched invoice to approval routing. The full journey still matters. But the reusable competence often sits in transition judgment.

The paper’s real contribution is not “RL beats symbolic planning”

The lazy headline would be: deep reinforcement learning now learns general planners. That is not quite what the paper says, and it would be less interesting if it did.

The paper’s more serious contribution is methodological. It shows that policy-gradient methods become much more intelligible when placed inside a planning language that tells us what states, goals, transitions, and domains are. The planning representation gives RL something it usually lacks: a precise language for talking about what should generalize.

This is why the authors can diagnose failures instead of merely reporting them. When Grid, Logistics, Reward, and Spanner underperform, the analysis does not collapse into “the neural net failed.” Grid and Logistics expose logical expressiveness limits. Reward exposes distance limitations from a finite number of GNN layers. Logistics also exposes the cost of optimizing shortest expected paths when compact optimal general policies may not exist. Spanner exposes a sampling problem around dead-end states.

That is a better research pattern than the standard baseline derby. A baseline derby tells you which method won under yesterday’s experimental conditions. A mechanism diagnosis tells you what to change tomorrow.

What this means for business automation agents

The direct result belongs to classical planning benchmarks. The business interpretation must therefore be careful.

What the paper directly shows:

Standard actor-critic policy-gradient methods can learn strong general policies in several symbolic planning domains when policies are represented over successor states.
Relational GNNs can provide size-general representations over objects and predicates.
Some failures can be repaired by adding derived predicates or changing the objective away from shortest expected cost toward cycle-free reachability.
The most valuable failure analysis comes from understanding the domain language, not from blindly swapping RL algorithms.

What Cognaptus infers for business use:

Enterprise agents should be designed around stable workflow state representations, not just UI-level actions.
Generalization should be tested across instance families: more vendors, more document types, more exception classes, more approval chains.
Derived predicates are operational gold. “Invoice is from known vendor,” “shipment is blocked by customs,” “customer request needs compliance review,” and “case is waiting on third-party evidence” are not cosmetic labels. They are the concepts that make reusable policies possible.
Objectives should be designed for reliable completion and auditability, not merely shortest path or lowest immediate cost.

What remains uncertain:

The paper assumes a known lifted planning model. Most real enterprise systems do not hand you one politely wrapped in PDDL.
Successor states must be generated or known. In GUI automation, web workflows, or cross-system operations, that can be expensive or unreliable.
Symbolic state extraction remains a major bottleneck. If the system cannot reliably know what state it is in, transition scoring becomes elegant nonsense.
Safety, permissions, data privacy, and human escalation rules are outside the benchmark setting but central in production.

So the business lesson is not “deploy actor-critic agents everywhere.” Please do not. The lesson is: before asking which learning algorithm to use, ask whether the agent has the right state language, the right transition abstraction, and the right objective.

The uncomfortable message for agent builders

The paper is quietly hostile to a common agent-building fantasy: give the model tools, give it rewards, and let it figure out reusable behavior.

Maybe it will, in toy environments. In domains where the objects, goals, and action opportunities vary systematically, generalization often comes from the domain representation. The learning method needs something stable to learn over. In this paper, that stable object is the state transition.

This has an awkward implication for businesses. The expensive part of agent deployment may not be model access. It may not even be fine-tuning. It may be the boring ontology work: defining workflow states, relations, derived predicates, transition legality, failure states, and completion criteria. In other words, the thing everyone wants to skip because it looks like documentation.

Naturally, that is the thing that makes the learning problem coherent.

The paper also suggests a better way to evaluate enterprise agents. Do not only ask whether the agent completes a demo workflow. Ask whether the learned policy still works when the number of objects changes, when the order of subtasks changes, when a new exception type appears, or when the shortest route conflicts with the safest general procedure. A system that survives those tests is beginning to learn the domain. A system that fails them has probably learned the demo.

Boundaries: where this paper should not be overread

The work is impressive because it narrows the problem enough to understand it. That is also its boundary.

The benchmarks are classical planning domains with symbolic states, generated reachable state spaces, and known domain models. The actor-critic policies operate with access to possible successor states. The GNN receives relational structures, not raw enterprise screens, emails, PDFs, voice calls, or inconsistent database entries. Production agents would need perception, state extraction, schema maintenance, exception handling, authorization controls, logging, and human override mechanisms.

The paper also does not claim that GNNs magically remove the need for domain engineering. In fact, the Logistics repair shows the opposite. A derived predicate was needed because the GNN could not express a required relation. In business terms, you may need to add the right operational concept before the learning system can behave intelligently. Intelligence, in this case, begins with giving the model the vocabulary it was missing.

Finally, plan quality matters. Coverage alone is not enough. A policy that reaches the goal through absurdly long paths may be technically successful and operationally useless. The Logistics result makes this visible: before the repair, deterministic coverage could look respectable while plan quality was dreadful. After the repair, both coverage and quality improved. That combination is the standard worth caring about.

From action habits to domain policies

The title says policy gradients grow up, but the maturity does not come from policy gradients alone. It comes from putting them in the right representational discipline.

The paper’s causal chain is clear:

Ground actions do not generalize because their names and counts change across instances.
State transitions can generalize because they describe reusable movement through a domain.
Relational GNNs can encode variable-size symbolic states, but only within expressiveness limits.
Actor-critic methods can learn useful transition policies when the representation and objective are aligned.
When results fail, the right repair may be a derived predicate or a different objective, not a newer RL acronym.

For Cognaptus readers building or evaluating AI automation systems, this is the useful takeaway: generalization is not a vibe. It is a property of the representation, the objective, and the test distribution. If the agent is trained on brittle action labels, it will learn brittle habits. If it is trained over meaningful state transitions, it has a chance to learn a domain policy.

That chance is not automatic. But at least it is now pointed at the right object.

Cognaptus: Automate the Present, Incubate the Future.

Simon Ståhlberg, Blai Bonet, and Hector Geffner, “Learning General Policies with Policy Gradient Methods,” arXiv:2512.19366, 2025. ↩︎

The problem is not that RL cannot plan. It is that it keeps learning the wrong object.#

Generalization starts by refusing to learn instance-specific action names#

The mechanism is a three-part bargain#

The main evidence is coverage, not a leaderboard beauty contest#

The failures are not random; they point to the representation#

The Logistics repair is the paper’s cleanest lesson#

Why one-step transition sampling beats longer rollouts here#

The paper’s real contribution is not “RL beats symbolic planning”#

What this means for business automation agents#

The uncomfortable message for agent builders#

Boundaries: where this paper should not be overread#

From action habits to domain policies#