Opening — Why this matters now
Reinforcement learning keeps winning benchmarks, but keeps losing the same argument: it doesn’t generalize. Train it here, deploy it there, and watch confidence evaporate. Meanwhile, classical planning—decidedly uncool but stubbornly correct—has been quietly producing policies that provably work across arbitrarily large problem instances. This paper asks the uncomfortable question the RL community often dodges: can modern policy-gradient methods actually learn general policies, not just big ones?
The answer, inconveniently for everyone involved, is yes—but only if we stop pretending actions are the right abstraction.
Background — Two traditions talking past each other
Classical planning defines general policies as solutions that work across all instances of a domain, regardless of how many objects appear. You don’t learn “move block A onto block B”; you learn structural rules about states. Logical and combinatorial methods do this well, but scale poorly and demand heavy symbolic engineering.
Deep RL, by contrast, scales effortlessly and optimizes beautifully—but overfits like it’s being paid by the parameter. Most RL policies bind directly to ground actions, which change with every instance. That’s fatal for generalization.
The key insight borrowed from planning is simple and brutal: actions are instance-specific; state transitions are not.
What the paper actually does
The authors reformulate policy learning as a state-transition classification problem. Instead of choosing an action, the policy assigns probabilities to successor states: which transitions are “good” and should be followed.
To make this work at scale, they combine:
- Actor–critic policy gradients, used almost verbatim
- Graph Neural Networks (GNNs), adapted to represent relational planning states
- Policies over successor states (\pi(s’\mid s)), rather than actions
Two variants are explored:
| Algorithm | Key Idea | Tradeoff |
|---|---|---|
| AC-1 | Sampled actor–critic | Model-free, slower convergence |
| AC-M | All-actions actor–critic | Uses full successor set, faster but model-dependent |
Crucially, training does not rely on long trajectories. Updates are performed on single transitions. This is not a hack—it’s a deliberate rejection of trajectory bias when no state is privileged.
Architecture — Why GNNs are doing the heavy lifting
Planning states are relational objects, not vectors. The GNN encodes each object, passes messages through predicates, and aggregates embeddings into:
- A value function (V(s))
- A transition-level policy (\pi(s’\mid s))
Because the architecture is invariant to object permutations and graph size, the learned policy naturally transfers to larger instances—up to the expressivity limits of the network.
And yes, those limits matter.
Findings — Where it works, where it breaks
Out of ten benchmark planning domains:
- 6 domains achieved near-perfect generalization out of the box
- Blocks, notoriously NP-hard, was solved with 100% coverage (non-optimal but correct)
Failures were not random. They fell into two clean categories:
1. Expressivity limits of GNNs
Standard message-passing GNNs correspond roughly to the C2 fragment of first-order logic. Domains like Logistics and Grid require higher-order relational reasoning (C3). The fix? Derived predicates. Add the missing relational signal, and performance jumps.
2. Optimality vs generality
Some domains admit compact general policies—but no compact optimal ones. RL optimizes expected cost, so it overcommits to optimality and sacrifices coverage. Changing the cost structure to optimize goal reachability without cycles restores generalization.
Failures, in other words, were diagnosable and correctable.
Why this matters for business AI
This paper quietly dismantles a false dichotomy:
Symbolic systems generalize but don’t scale. RL scales but doesn’t generalize.
The result shows that policy-gradient methods can learn domain-level reasoning—if you give them:
- The right abstraction (states, not actions)
- The right inductive bias (relational structure)
- The right objective (general success, not local optimality)
For anyone building autonomous agents, planning systems, or long-horizon automation workflows, this is not academic trivia. It’s a blueprint for systems that transfer.
Conclusion — RL didn’t fail, we just asked it the wrong question
Deep RL’s generalization problem was never about gradients or optimizers. It was about language. Once policies are expressed over invariant structures, actor–critic methods behave far more like planners than gamblers.
Symbolic reasoning and gradient descent are not rivals. They are complementary solvers—and pretending otherwise has cost the field a decade.
Cognaptus: Automate the Present, Incubate the Future.