A workflow agent usually looks clever right up to the moment one service is down, one permission changes, or one customer case arrives with the wrong sort of mess attached.
Then the question becomes painfully simple: did the model learn a plan, or did it learn the usual route?
That distinction is the centre of Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective, an ICLR 2026 paper by Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen.1 The paper is not another victory lap for reinforcement learning. It is more useful than that. It asks what, mechanically, changes when a language model is trained for planning with reinforcement learning rather than supervised fine-tuning.
The answer is not “RL teaches reasoning”, the standard fairy tale sold in three slides and a procurement deck. The answer is narrower and sharper: RL can improve planning because it changes the data the model learns from. Policy-gradient methods improve mainly by exploring and training on successful generated paths. But they can also collapse toward a single preferred path even after training accuracy is perfect. Q-learning can preserve multiple valid paths and support off-policy learning, but only when the reward is shaped around the process of planning rather than the final outcome alone.
That is the useful lesson for builders. Not “use RL”. Not “SFT is dead”. Not “Q-learning will save agents, bless its ancient Atari soul”. The lesson is that the training signal determines what kind of graph the model internalises, and the graph it internalises determines whether the agent has options when reality refuses to follow the demo script.
The paper turns planning into path-finding, which is exactly the right simplification
The paper studies planning by stripping away language semantics and modelling the task as path-finding on a directed graph $G=(V,E)$. Each node is represented as a token. A valid plan is a path from a source node $s$ to a target node $t$. In Blocksworld, for example, each node is a block configuration and each edge is a legal single move. In a tool-use agent, the same abstraction maps naturally onto a graph of valid tool calls, dependencies, and state transitions.
This abstraction is simple, but not simplistic. It isolates the part of planning that business systems actually care about: choosing a legal sequence of actions that reaches a target. A customer-service workflow, a claims-processing pipeline, a compliance escalation chain, or a multi-tool retrieval process can all be viewed as graph navigation. The vocabulary changes. The structural question does not.
The authors use this graph setup to compare three training mechanisms:
| Training mechanism | What the model learns from | What the paper investigates |
|---|---|---|
| Supervised fine-tuning | Fixed demonstrations of valid paths | Whether demonstrations produce true reachability knowledge or memorised co-occurrences |
| Policy-gradient RL | The model’s own generated rollouts, rewarded by final success | Whether exploration explains RL’s advantage, and whether diversity survives |
| Q-learning | State-action values trained from outcome or process rewards | Whether value learning can preserve structure, diversity, and off-policy usefulness |
The empirical setup is deliberately controlled. The main experiments use a one-layer, single-head Transformer, with a 100-node Erdős-Rényi graph and sampled paths for reachable source-target pairs. The paper also tests Blocksworld-derived graphs and additional Erdős-Rényi variants. This is not a frontier model evaluation. It is a microscope.
And microscopes are allowed to be small. Their job is not to be the organism.
SFT learns the observed shortcut, not the hidden map
The first mechanism is supervised fine-tuning. In the paper’s setup, SFT trains the model on sampled paths such as:
$s,\ t,\ s,\ a,\ b,\ c,\ t,\ \text{end}$
The model sees a target, a current node, and the next node that happened to appear in the demonstration. The theoretical result characterises the stable point of SFT: under the paper’s modelling assumption, the model’s next-token distribution becomes the empirical distribution of observed triples.
In plain form, if the training data contains counts $n(t,j,k)$ for target $t$, current node $j$, and next node $k$, SFT learns something like:
when that target-current context appears in the data. If it does not appear, the model has no useful constraint.
That is not planning. That is frequency-conditioned imitation.
The important detail is transitivity. A graph may imply that node $t$ is reachable from node $j$ through a chain of legal moves. But if that relationship is not sufficiently represented in the demonstrations, SFT does not infer the full reachability structure. It can learn that certain next moves often followed certain target-current pairs. It does not necessarily learn the underlying map.
For business agents, this is the difference between learning “when refund cases appear, call Tool A then Tool B” and learning “from this customer state, these three legal next actions can still reach resolution”. The first is useful until the usual route breaks. The second is planning.
The paper’s Blocksworld adjacency visualisation reinforces this point. Some adjacency relations that appear in the SFT data are still not strongly captured, especially when their frequency is low. The figure is not the main theorem; it is a diagnostic illustration. Its purpose is to show why the theoretical stable-point result matters in a more recognisable planning setting.
SFT is not useless. It gives the model a base policy and some structural hints. But in this paper, SFT alone is a parrot with a route memory. Sometimes that is enough. Often, it is just enough to be dangerous.
Policy-gradient wins because it creates new successful data
Policy-gradient RL is the next mechanism. The paper’s first important result is almost rude in its simplicity: with a 0-1 outcome reward, no KL regularisation, and the paper’s assumptions, each policy-gradient update is equivalent to supervised fine-tuning on the successful paths generated during exploration.
That sounds like a downgrade until the second half lands. The dataset is no longer fixed.
SFT trains on a static sample of demonstrations. PG trains on rollouts produced by the model itself. As the model improves, it can discover correct paths that were not present in the original SFT dataset. Those successful paths then become new training material. The performance gain comes less from a mystical “reasoning objective” and more from exploration-driven data augmentation.
This is the paper’s most practical correction to the lazy slogan “RL generalises”. RL generalises here because it searches, finds additional successful trajectories, and trains on them. The model gets more of the graph.
The business translation is straightforward. Suppose an internal agent is trained on historical workflow traces. The historical traces contain the routes employees usually took, not all routes that would have worked. Policy-gradient training can discover additional valid action sequences if the environment allows safe exploration and if success is verifiable. That is valuable.
But it also means the benefit is conditional. If exploration is narrow, expensive, unsafe, or rewarded incorrectly, PG does not automatically acquire a better map. It becomes SFT on whatever lucky successes it happened to generate. The magic was data generation all along. Slightly embarrassing for the magic, but excellent for engineering.
Perfect training accuracy can hide a one-path policy
The paper’s second PG result is the uncomfortable one: diversity collapse.
In the theoretical analysis, PG without KL regularisation can drive the probability of wrong paths toward zero. That is the good news. But after training accuracy reaches 100%, output diversity can continue to decline. The model increasingly concentrates probability on fewer valid next steps. Eventually, empirically, it may produce only one correct path per source-target pair.
This is not failure by accuracy. It is failure by optionality.
That distinction matters because most enterprise evaluations still worship final-answer accuracy. For short tasks, that may be tolerable. For long-horizon agents, it is negligence wearing a dashboard. If a model knows only one path through a workflow, it is fragile even when the path is correct. A single unavailable API, rate limit, permission mismatch, data-quality problem, or policy exception can invalidate the route.
The paper measures diversity as the average number of distinct correct paths generated over repeated sampling trials for the same source-target pair. In Figure 2, PG without KL maintains perfect training accuracy while diversity keeps falling. When diversity diminishes, test accuracy can degrade with continued training. That is the practical warning: the model can look increasingly competent on the training objective while becoming less adaptable.
In product terms, “100% train success” may mean “the agent has found one comfortable corridor and is now painting over all the doors”.
KL regularisation preserves options by tying the model to its base policy
The paper then analyses KL regularisation, the common trick of penalising the trained policy for drifting too far from the base model. Mechanistically, KL acts as a diversity-preserving term. It keeps probability mass closer to the base model’s distribution, which can prevent valid but lower-probability paths from disappearing.
But this is not free. If the base model is already reasonably capable and diverse, KL can help generalisation by preserving useful alternatives. If the base model is weak, KL can restrain learning and cap improvement. The paper’s Figure 2(d) shows the trade-off: stronger KL improves output diversity but limits training accuracy.
This is an unusually useful result because it explains why KL regularisation seems both essential and annoying in practice. It is not a moral principle. It is a leash. A leash is useful when the dog knows the park. Less useful when the dog is confidently walking into traffic.
For business deployment, the implication is not “always add KL”. It is:
| Base model condition | KL effect | Operational interpretation |
|---|---|---|
| Base model has broad, decent coverage | Preserves alternative valid paths | Useful for resilience and generalisation |
| Base model has poor coverage or wrong priors | Restricts necessary updates | May protect bad habits |
| Evaluation tracks only final success | KL trade-off may be invisible | Diversity loss can be missed |
| Evaluation tracks path diversity and recovery | KL can be tuned intelligently | The trade-off becomes measurable |
The important unit is not the coefficient itself. The important unit is the behaviour it protects or prevents.
Q-learning works when the reward describes the process, not just the trophy
The paper’s Q-learning section is the sharpest part of the mechanism story.
Q-learning tries to estimate action values: from this state, how good is this next action? In the paper, the model logits are used to approximate the Q-function. The authors study two reward designs.
The first is outcome reward: reward the model only if the full generated path is correct. This sounds reasonable. It is also where Q-learning goes wrong. The paper shows that with outcome-only reward, Q-learning suffers from Q-value bias. At stable points, the logits collapse into values that depend only on the target, losing the state-action structure needed for planning. Empirically, Q-learning with outcome rewards collapses toward near-zero train and test accuracy.
So no, “just reward success” is not enough. The model needs to know which transitions were structurally valid.
The second design is process reward. Here the reward exposes intermediate structure: reaching the target is rewarded, and invalid transitions to non-adjacent nodes are penalised. Under persistent exploration, Q-learning with process rewards converges to values that recover the relevant adjacency and reachability structure. Feasible next nodes converge to similar high values, which preserves action diversity.
This matters because it changes what the model is learning. PG learns from successful whole trajectories. Q-learning with process rewards learns which local transitions preserve reachability to the target. That is closer to learning the map rather than memorising the itinerary.
The paper also shows an important off-policy advantage. Q-learning with process rewards can train from rollouts generated by a different policy, including a base model. The authors explicitly connect this to practical settings where rollouts from quantised models or large-batch generation are effectively off-policy. For production systems, that matters because the actor that generates experience is often not identical to the model being updated. Cost, latency, quantisation, and infrastructure batching all get in the way of clean textbook on-policy training. Reality, predictably, has poor respect for algorithm diagrams.
The experiments validate mechanisms, not a leaderboard
The empirical sections are best read as mechanism checks. They are not trying to prove that a tiny Transformer on graph tokens is the next enterprise agent platform. The point is to test whether the theoretical behaviours appear in controlled training dynamics.
| Evidence component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Figure 1, Blocksworld adjacency weights | Mechanism illustration | SFT under-captures some adjacency relations; RL variants capture more structure | That the exact same magnitude transfers to large commercial LLMs |
| Figure 2, PG training curves | Main evidence | PG improves over continual SFT through exploration; PG without KL collapses diversity; KL trades accuracy for diversity | That KL has one universal optimal coefficient |
| Figure 3, Q-learning versus PG | Main comparison | Process-reward Q-learning improves test accuracy and preserves diversity; outcome-only Q-learning fails | That Q-learning is always operationally superior |
| Figure 4, Q-learning logits heatmap | Mechanism validation | Process rewards raise logits for feasible next nodes and recover graph structure | That learned values are robust to messy real-world state definitions |
| Figures 5 and 6, attention maps | Assumption validation / robustness | The trained Transformers mainly attend to target and current nodes, supporting the paper’s modelling assumption | That all deep LLMs behave this cleanly |
| Figures 7 and 8, additional Erdős-Rényi splits | Robustness / sensitivity test | KL can balance learning new pairs and forgetting old ones; Q-learning may converge more slowly when initial performance is poor | That the training recipe is plug-and-play |
This table matters because the paper has several different kinds of evidence. Some figures support the main theorem-to-experiment chain. Some test assumptions. Some probe sensitivity. Some connect the abstraction to Blocksworld. Treating all of them as “results” would flatten the paper into a much less useful summary.
The appendix also matters. The attention-map analysis supports the assumption that the trained Transformer’s next-token prediction depends primarily on the target and current node. The additional Erdős-Rényi experiments show that KL regularisation has opposing effects: weak or no KL helps learn new RL training pairs, while no KL also encourages forgetting of previously learned SFT knowledge. A well-chosen small KL coefficient performs best in that setting. Meanwhile, Q-learning can converge more slowly when the initial model performs poorly on new training pairs.
That last point is a necessary brake on enthusiasm. Q-learning with process rewards is promising in the paper’s abstraction, but it is not advertised as frictionless. If the initial model generates many failures, the value-learning process can be slower. Ancient methods rarely become modern miracles without paperwork.
The business lesson is to evaluate routes, not just arrivals
For business systems, the paper’s central value is diagnostic. It gives leaders and builders a better vocabulary for asking whether an agent has learned a workflow structure or merely overfitted a successful route.
The direct finding is technical: in graph-based planning, SFT learns observed co-occurrence patterns; PG gains from exploration but can collapse diversity; Q-learning with process rewards can preserve diverse valid actions and train off-policy, while outcome-only Q-learning fails.
The Cognaptus inference is operational: agent evaluation should include path-level metrics.
A practical evaluation suite for workflow agents should measure at least five things:
-
Final task success on unseen goals. This remains necessary. It is just not sufficient.
-
Distinct valid paths per target. If repeated sampling produces only one valid route, the agent may be brittle even when accurate.
-
Recovery under injected failures. Disable a tool, add latency, change a permission, or remove an intermediate data source. A planning-capable agent should reroute rather than hallucinate victory.
-
Illegal transition rate. In business workflows, invalid transitions are not harmless. They are compliance incidents, data corruption, customer confusion, or all three if the week is feeling theatrical.
-
Reward auditability. If process rewards are used, teams must know which intermediate behaviours are rewarded, penalised, or ignored. Reward design becomes part of system governance, not a tuning footnote.
This shifts the evaluation question from “Did the agent complete the task?” to “How many structurally valid ways does the agent know to complete the task, and what happens when the favourite one fails?”
That is a better question. It is also a less flattering one, which is usually how we know it is useful.
Reward design becomes process design
The paper’s strongest business implication is not that every company should immediately implement Q-learning. It is that long-horizon automation needs explicit process structure.
If a workflow can be represented as states and legal transitions, then process rewards become feasible. A claims agent can be rewarded for valid document checks, valid escalation paths, and correct closure conditions. A procurement agent can be penalised for skipping approval states. A legal research agent can be rewarded for moving from jurisdiction identification to authority retrieval to citation validation rather than jumping straight to a confident paragraph wearing a fake moustache.
But this requires the organisation to know its own process graph. Many do not. They have SOP documents, exception habits, Slack folklore, and three senior employees who know why the “simple” route actually violates finance policy in Q4. Turning that into reward structure is hard. It is also where much of the value sits.
The paper therefore supports a sober product thesis: better agent training depends less on sprinkling RL over a model and more on making workflows machine-checkable. The model needs feedback at the level where mistakes occur. If the only reward is final success, the training loop may not learn the structure that made success possible.
Where the theory stops
The paper is careful, and the business interpretation should be careful as well.
First, the main theory uses a graph abstraction and simplified Transformer settings. That abstraction is valuable because it exposes learning dynamics, not because it captures every detail of enterprise software. Real agents operate with ambiguous natural language states, partial observability, changing APIs, user preferences, retrieval noise, and policies that were apparently written during a committee’s lunch break.
Second, the empirical backbone is intentionally small: one-layer, one-head Transformers in the main setup, plus additional validation including two-layer attention analysis and Blocksworld-derived graphs. This supports mechanism claims. It does not by itself settle scaling behaviour in frontier LLMs.
Third, Q-learning’s advantage depends on process rewards and persistent exploration. Outcome-only Q-learning fails in the paper. Poorly designed process rewards can also fail in practice. The phrase “process reward” should not be allowed to become another decorative label for “we added some heuristics and hoped legal would not ask”.
Fourth, diversity is not always free or always desirable. In regulated workflows, alternative paths must be valid, auditable, and policy-compliant. Diversity means preserving multiple correct options, not improvising creatively in a loan approval system. Creativity is charming in fiction. In compliance automation, it is an incident report warming up.
The paper’s real message: RL changes the training graph
The cleanest way to read the paper is mechanism-first:
SFT learns the path frequencies it sees.
Policy-gradient creates new successful path data through exploration, then risks collapsing onto a narrow subset of those paths.
KL regularisation preserves some of the base model’s breadth, but also restrains learning.
Q-learning with process rewards can learn local transition structure, preserve multiple valid next actions, and tolerate off-policy data.
That causal chain is more useful than the headline “RL improves planning”. It tells builders what to instrument. It tells executives where risk hides. It tells product teams why an agent that passes a benchmark may still be one outage away from looking less like an assistant and more like a very expensive macro.
For Cognaptus, the practical stance is simple: do not judge agent planning by the arrival alone. Inspect the routes. Preserve valid alternatives. Make the process graph explicit where possible. And when using reinforcement learning, remember that the reward is not a motivational poster. It is the curriculum.
Agents do not become planners because they receive applause at the finish line. They become planners when training teaches them which steps keep the future open.
Cognaptus: Automate the Present, Incubate the Future.
-
Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng, Li Dong, Yaru Hao, and Wei Chen, “Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective,” arXiv:2509.22613v2, published as a conference paper at ICLR 2026. https://arxiv.org/abs/2509.22613 ↩︎