TL;DR
- SFT memorizes co-occurrences; RL explores. That’s why RL generalizes better on planning tasks.
- Policy-gradient (PG) can hit 100% training accuracy while silently killing output diversity. KL helps—but caps gains.
- Q-learning with process rewards preserves diversity and works off‑policy. With outcome‑only rewards, it reward-hacks and collapses.
Why this paper matters to builders
If you’re shipping agentic features—tool use chains, workflow orchestration, or multi-step retrieval—you’re already relying on planning. The paper models planning as path-finding on a graph and derives learning dynamics for SFT vs RL variants. The results give a crisp blueprint for product choices: which objective to use, when to add KL, and how to avoid brittle one-path agents.
Core ideas, in plain business terms
1) SFT looks smart but memorizes patterns
What happens: SFT converges to the frequency of (target, current → next) triplets it saw. It lacks a mechanism to infer transitive reachability. In business apps, that’s the intern who only remembers common playbooks and blanks on novel compositions.
Implication: SFT-heavy agents overfit to popular tool paths and miss rare but critical routes (e.g., a seldom-used compliance checker in a loan pipeline). They’ll appear competent on demos yet break on edge cases.
2) Policy-gradient wins by exploration—then slowly forgets how to be diverse
What happens: Basic PG with outcome rewards is equivalent to running SFT only on the successful rollouts it explores. The exploration expands training coverage—that’s the win. But as training continues, PG tends toward a single canonical path per goal even after reaching perfect train accuracy. That’s the loss: diversity collapse.
Implication: Your agent nails the KPI on the training suite, then in production struggles when the canonical path breaks (API outage, tool latency, permission error). Without path variety, resiliency drops.
Mitigation: Add KL-regularization to keep the policy close to a more diverse base model. Trade‑off: stronger KL preserves diversity but caps accuracy gains.
3) Q-learning with process rewards keeps both accuracy and breadth
What happens: Using process rewards (reward adjacency-valid steps and hitting target; penalize illegal moves) Q-learning learns the graph structure (adjacency + reachability). It naturally supports off‑policy learning (e.g., rollouts from a quantized policy or big batches) and preserves multiple valid next-steps at convergence.
Gotcha: With outcome-only rewards, Q-learning collapses (reward hacking). So process shaping isn’t optional—it’s the point.
Implication: For enterprise agents that must route around failures, Q-learning + process rewards is the safer default: robust, multi-path, off‑policy-friendly.
A concrete metaphor: call center triage as a graph
- Nodes: states in a customer-resolution journey (verify identity → fetch CRM → issue refund → notify logistics → close case).
- Edges: legal transitions.
- SFT: learns frequent macro-routes (refund → email) but may miss a rare branch (refund → notify logistics → stock audit) because it never needed transitive reasoning.
- PG: discovers the stock‑audit route via exploration—but later collapses to only one preferred path.
- Q-learning + process rewards: keeps several high‑quality next steps (notify logistics or email customer), so when one service degrades, the policy naturally takes the alternative.
Builder’s table: choosing your post-training recipe
Dimension | SFT | Policy-Gradient (PG) | Q-Learning (Outcome‑only) | Q-Learning (Process rewards) |
---|---|---|---|---|
Where gains come from | Memorizes frequent co-occurrences | Exploration augments data | None (tends to hack) | Learns adjacency + reachability |
Output diversity at convergence | Medium (mirrors data) | Low (diversity collapse) | Trivial/degenerate | High (multiple valid next-steps retained) |
Off-policy compatibility | N/A | Weak | OK | Strong |
Reward shaping need | N/A | Useful via KL | Critical (still fails) | Critical & beneficial |
Failure mode | Spurious patterns, poor generalization | Overfit canonical path | Reward hacking | Requires well-designed step rewards |
Best for | Static flows, short horizons | Rapid wins on known tasks with careful KL | (Avoid) | Long-horizon, tool-rich, failure-tolerant agents |
Implementation playbook
-
Model your workflow as a graph. Enumerate legal next steps (edges) and terminal targets; this is the substrate for process rewards.
-
Start with a capable base. Pretrain/SFT on curated demonstrations for coverage; keep it reasonably diverse (temperature > 0 at data-gen time).
-
If you use PG:
- Add KL against the base with a small coefficient (tune on held-out goals, not just steps).
- Monitor diversity metrics (e.g., distinct valid paths per goal under temperature sampling). Alert on collapse.
-
Prefer Q-learning with process rewards for production agents:
- Target check: +1 when reaching the goal.
- Adjacency check: −1 for illegal transitions.
- Optional small shaping for progress (e.g., heuristic distance-to-goal) if your graph supports it.
-
Train off-policy safely. You can roll with a cheaper/quantized actor to generate experience; Q-learning tolerates this.
-
Guard against reward hacking. Unit-test the reward function; add adversarial rollouts that try to exploit shortcuts.
-
Evaluate beyond accuracy. Track: (a) test accuracy on unseen goal pairs, (b) path diversity, (c) recovery rate under injected tool failures, (d) latency/compute with and without exploration.
How this updates our stance at Cognaptus
We’ve argued that “agents fail silently when the graph is implicit.” This paper backs that intuition with math: explicit structure + process rewards is not nice-to-have; it’s the lever that turns brittle scripts into resilient planners. Our default recommendation shifts toward Q-learning with process rewards for multi-step business automations, reserving PG + light KL for simpler flows or fast iterations.
Open questions we’re watching
- Scaling the abstraction: How faithful is the graph view for messy real stacks where ‘states’ are vector DB snapshots, tool rate limits, and user context?
- Reward design at scale: Can we auto-derive step rewards from API schemas and guardrail validators?
- Diversity as a product metric: What’s the right threshold (and cost) for “enough alternative paths” in SLAs?
Bottom line for executives
If your automation relies on a single “happy path,” you’re one outage away from fire drills. Encode the graph. Shape the reward. Preserve options. That’s the pragmatic recipe this theory vindicates.
Cognaptus: Automate the Present, Incubate the Future