Policy Gradients Grow Up: Teaching RL to Think in Domains

Opening — Why this matters now

Reinforcement learning keeps winning benchmarks, but keeps losing the same argument: it doesn’t generalize. Train it here, deploy it there, and watch confidence evaporate. Meanwhile, classical planning—decidedly uncool but stubbornly correct—has been quietly producing policies that provably work across arbitrarily large problem instances. This paper asks the uncomfortable question the RL community often dodges: can modern policy-gradient methods actually learn general policies, not just big ones?

The answer, inconveniently for everyone involved, is yes—but only if we stop pretending actions are the right abstraction.

Background — Two traditions talking past each other

Classical planning defines general policies as solutions that work across all instances of a domain, regardless of how many objects appear. You don’t learn “move block A onto block B”; you learn structural rules about states. Logical and combinatorial methods do this well, but scale poorly and demand heavy symbolic engineering.

Deep RL, by contrast, scales effortlessly and optimizes beautifully—but overfits like it’s being paid by the parameter. Most RL policies bind directly to ground actions, which change with every instance. That’s fatal for generalization.

The key insight borrowed from planning is simple and brutal: actions are instance-specific; state transitions are not.

What the paper actually does

The authors reformulate policy learning as a state-transition classification problem. Instead of choosing an action, the policy assigns probabilities to successor states: which transitions are “good” and should be followed.

To make this work at scale, they combine:

Actor–critic policy gradients, used almost verbatim
Graph Neural Networks (GNNs), adapted to represent relational planning states
Policies over successor states (\pi(s’\mid s)), rather than actions

Two variants are explored:

Algorithm	Key Idea	Tradeoff
AC-1	Sampled actor–critic	Model-free, slower convergence
AC-M	All-actions actor–critic	Uses full successor set, faster but model-dependent

Crucially, training does not rely on long trajectories. Updates are performed on single transitions. This is not a hack—it’s a deliberate rejection of trajectory bias when no state is privileged.

Architecture — Why GNNs are doing the heavy lifting

Planning states are relational objects, not vectors. The GNN encodes each object, passes messages through predicates, and aggregates embeddings into:

A value function (V(s))
A transition-level policy (\pi(s’\mid s))

Because the architecture is invariant to object permutations and graph size, the learned policy naturally transfers to larger instances—up to the expressivity limits of the network.

And yes, those limits matter.

Findings — Where it works, where it breaks

Out of ten benchmark planning domains:

6 domains achieved near-perfect generalization out of the box
Blocks, notoriously NP-hard, was solved with 100% coverage (non-optimal but correct)

Failures were not random. They fell into two clean categories:

1. Expressivity limits of GNNs

Standard message-passing GNNs correspond roughly to the C2 fragment of first-order logic. Domains like Logistics and Grid require higher-order relational reasoning (C3). The fix? Derived predicates. Add the missing relational signal, and performance jumps.

2. Optimality vs generality

Some domains admit compact general policies—but no compact optimal ones. RL optimizes expected cost, so it overcommits to optimality and sacrifices coverage. Changing the cost structure to optimize goal reachability without cycles restores generalization.

Failures, in other words, were diagnosable and correctable.

Why this matters for business AI

This paper quietly dismantles a false dichotomy:

Symbolic systems generalize but don’t scale. RL scales but doesn’t generalize.

The result shows that policy-gradient methods can learn domain-level reasoning—if you give them:

The right abstraction (states, not actions)
The right inductive bias (relational structure)
The right objective (general success, not local optimality)

For anyone building autonomous agents, planning systems, or long-horizon automation workflows, this is not academic trivia. It’s a blueprint for systems that transfer.

Conclusion — RL didn’t fail, we just asked it the wrong question

Deep RL’s generalization problem was never about gradients or optimizers. It was about language. Once policies are expressed over invariant structures, actor–critic methods behave far more like planners than gamblers.

Symbolic reasoning and gradient descent are not rivals. They are complementary solvers—and pretending otherwise has cost the field a decade.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Two traditions talking past each other#

What the paper actually does#

Architecture — Why GNNs are doing the heavy lifting#

Findings — Where it works, where it breaks#

1. Expressivity limits of GNNs#

2. Optimality vs generality#

Why this matters for business AI#

Conclusion — RL didn’t fail, we just asked it the wrong question#