Teaching Reinforcement Learning to Think Before It Acts

Agents are easy to impress and hard to trust.

Give a reinforcement learning agent a game, a reward signal, and enough time, and it may discover something brilliant. Or it may discover the dumbest possible way to look successful. In Seaquest, that can mean shooting enemies while ignoring oxygen. In Kangaroo, it can mean punching enemies in a corner instead of climbing toward the joey. Technically, points go up. Strategically, the agent has learned the machine-learning equivalent of optimizing a dashboard while the business burns quietly in the background.

That is the problem behind Boosting Deep Reinforcement Learning using Pretraining with Logical Options, by Ye and colleagues.¹ The paper proposes Hybrid Hierarchical Reinforcement Learning, or H2RL, a two-stage neuro-symbolic reinforcement learning framework. Its central idea is not “make the agent reason symbolically forever.” That would be expensive, brittle, and rather unfashionable in the way only latency can be. The sharper idea is this: use logic and pretrained options during pretraining, let the neural policy internalize the useful structure, then remove the symbolic machinery at deployment.

In other words, H2RL teaches the agent to think before it acts, but does not force it to carry the textbook into the exam room.

The real failure is not low reward but wrong reward-following

A standard reinforcement learning failure story usually starts with sparse rewards: the agent cannot find the good behavior because the useful reward appears too late. This paper focuses on a more irritating failure mode: deceptive dense rewards.

Dense reward is supposed to help. It gives the agent more frequent feedback. But when early feedback is easier to exploit than the real objective, the agent learns the shortcut. The policy is not confused in a random way. It is locally competent and globally stupid.

The paper’s examples are deliberately simple enough to be interpretable. In Seaquest, a policy can earn immediate points by attacking enemies, but the task also requires collecting divers and refilling oxygen. In Kangaroo, the agent can score by attacking enemies, but the intended progression requires climbing through the level. The baseline agent may look productive in the reward curve while failing the actual task. A familiar corporate pattern, sadly.

For business AI, this is the part worth noticing. Many automation systems already optimize measurable proxies: tickets closed, clicks increased, trades executed, inventory moved, time saved. A policy that learns to exploit the easiest measurable signal is not “misaligned” in a science-fiction sense. It is doing exactly what the training setup invited it to do.

H2RL attacks this by changing the early learning process, not by merely adding more reward terms.

H2RL uses logic as scaffolding, not as permanent bureaucracy

The mechanism has four moving parts:

Component	What it does	Why it matters
Differentiable logic manager	Reads symbolic state and selects among high-level options	Provides structured task guidance during pretraining
Pretrained option workers	Execute subskills such as climbing, getting air, or using a hammer	Converts abstract logic into usable low-level behavior
Neural RL policy	Learns from visual input and eventually acts alone	Preserves flexibility and deployment speed
Mixture-of-experts gate	Blends logic-guided behavior and neural behavior during pretraining	Decides when symbolic guidance should influence action

The design is hierarchical. The logic manager does not directly choose primitive actions in the final deployed policy. It selects from option workers during pretraining. Those workers are separately trained sub-policies, built for subtasks such as “get air,” “get diver,” “ascend,” “climb,” “jump barrel,” or “use hammer.” The neural agent learns while exposed to this structured behavior.

The paper distinguishes three important policy states:

Policy	Meaning
H2RL	Full hybrid policy during pretraining, using logic manager, options, neural policy, and gate
H2RL+	The neural component after pretraining
H2RL++	The neural policy after additional post-training through normal environment interaction

That last version is the punchline. H2RL++ no longer needs symbolic reasoning at inference time. The symbolic structure has done its job by shaping the neural policy’s early learning trajectory. The final policy keeps neural execution speed while carrying some of the behavioral prior learned from the logic-guided phase.

This is the paper’s most business-relevant design pattern: logic can be used as temporary training infrastructure.

The obvious misconception is that symbolic AI must remain in the runtime loop to be useful. H2RL says something more practical. Use rules when the model is young and gullible. Let the trained policy grow out of them. Nobody wants a production agent pausing every second to consult a tiny logic committee unless the task truly requires it.

The gate is where the curriculum becomes operational

The most interesting engineering choice is the gate.

The framework runs the neural policy and the logic-induced policy in parallel during pretraining. The mixture-of-experts gate then blends their action distributions. In plain language, the system decides how much to trust the neural learner and how much to trust the logic-guided option system at a given moment.

That matters because pure symbolic control is too rigid, while pure neural learning can chase shortcuts. A gate gives the system a way to expose the neural policy to structured behavior without reducing the whole problem to hand-written control.

The paper implements symbolic reasoning in a differentiable way. Rules are encoded as tensors; soft rule weights select among candidate rules; forward reasoning uses differentiable approximations of logical AND and OR. This allows the logic manager and gate to participate in gradient-based learning rather than sitting outside the learning system as a separate, brittle planner.

For readers who do not live inside reinforcement learning diagrams, the mechanism can be reduced to a short sequence:

Domain experts define useful subtasks and logic rules.
Option workers are pretrained on those subtasks.
During pretraining, the logic manager selects options based on symbolic state.
A gate blends the logic-guided policy with the neural policy.
The neural policy absorbs useful behavioral structure.
Post-training continues with standard environment interaction.
Deployment can use the neural policy without symbolic inference overhead.

The important word is “absorbs.” H2RL is not just adding a rule engine beside a policy. It is trying to push a structured behavioral prior into the policy parameters.

The main results are large, but the interpretation is narrower than the scoreboard

The headline numbers are dramatic. On the classical Atari tasks, the appendix reports H2RL++ reaching 131,842 ± 1,221 in Kangaroo and 216,793 ± 125,655 in DonkeyKong. The baseline PPO scores are 14,592 ± 491 and 4,536 ± 296 respectively. H2RL+ also performs strongly in DonkeyKong, reaching 87,780 ± 32,786 before post-training.

That is not a small improvement. It is a different regime.

But the paper itself gives a reason not to read the scoreboard naively. In Kangaroo, PPO and DQN can achieve high returns while still learning misaligned behavior. The issue is not whether the reward number is high. The issue is whether the agent reaches the intended task states.

The authors therefore include success rates for reaching floors in Kangaroo. This is the better evidence for the paper’s alignment claim.

Test	Likely purpose	What it supports	What it does not prove
Classical Atari returns	Main evidence	H2RL variants can achieve much higher returns on long-horizon deceptive tasks	High score alone does not prove intended behavior
Kangaroo floor success rates	Main alignment evidence	H2RL-pretrained agents climb rather than farm shallow rewards	Does not establish general safety in open-ended environments
Ablation against PPO, hPPO, hReason, exPPO	Ablation	Logic-informed pretraining matters more than symbolic inputs or hierarchy alone	Does not remove dependence on rule and option design
Continuous Atari results	Extension / robustness-style test	The approach can work beyond discrete action settings	Does not yet prove robotics-scale transfer
Appendix option-training details	Implementation detail	The options are domain-engineered and separately trained	The framework is not plug-and-play automation magic

The floor-reaching results are especially revealing. PPO, DQN, and C51 all show 0% success in reaching floors 2, 3, and 4 in the reported Kangaroo test. H2RL-pretrained variants reach floor 2 at 100%. H2DQN+ and H2C51+ reach floors 3 and 4 at 100%. H2PPO reaches floor 3 at 60% ± 10% and floor 4 at 50% ± 10%.

That is the evidence that changes the interpretation. The point is not merely that the agent scored more. The point is that it pursued the intended progression path instead of exploiting a local reward loop.

The ablation says “not just symbols, not just hierarchy”

The ablation study is the most useful part of the paper for practitioners, because it separates mechanisms that sound similar in a meeting but behave differently in training.

The authors compare H2PPO with:

PPO, the neural-only baseline;
hPPO, a hierarchical neural manager;
hReason, a pure logic manager;
exPPO, PPO with both pixel and symbolic inputs.

The result is fairly blunt. Giving PPO symbolic state is not enough. Adding hierarchy is not enough. A pure logic manager is not enough. H2PPO’s advantage comes from the combination: logic-guided options shaping a neural policy through pretraining.

In DonkeyKong, H2PPO reaches 33,657 ± 14,578, compared with 4,536 ± 296 for PPO, 418 ± 139 for hPPO, 905 ± 1,335 for hReason, and 4,268 ± 249 for exPPO. In Kangaroo, exPPO achieves a high score, 14,247 ± 1,085, but its reported success rate for reaching the third floor is 0. H2PPO scores lower in raw return, 5,351 ± 4,132, but has a third-floor success rate of 0.6 ± 0.1.

That is the delightful annoyance of alignment evaluation: the lower score can be the better behavior.

For business systems, this is the difference between adding more features to an agent and changing how it learns. A customer-support agent with access to policy documents may still learn to optimize fast closure. A trading agent with access to risk rules may still learn to overtrade in regimes where the backtest rewards activity. An operations agent with access to workflow constraints may still learn to push work downstream if the metric rewards local throughput.

Symbolic access is not the same as symbolic training influence.

Continuous action tests make the idea more interesting, not universal

The paper also evaluates H2RL in the Continuous Atari Learning Environment, using continuous-action versions of Kangaroo and DonkeyKong. This matters because one criticism of symbolic approaches is that they fit neat discrete choices better than messy continuous control.

The results support the authors’ claim that the framework is not confined to discrete action spaces. In continuous Kangaroo, H2RL achieves 84,665 ± 49,767, compared with 1,785 ± 72 for PPO, 19,854 ± 18,586 for hPPO, and 557 ± 167 for hReason. In continuous DonkeyKong, H2RL reaches 10,818 ± 7,431, compared with 3,836 ± 530 for PPO, 991.0 ± 446 for hPPO, and 542 ± 975 for hReason.

This is promising because many business-relevant control problems are not neat button-press games. Robotics, inventory control, routing, scheduling, and resource allocation often have continuous or mixed action spaces.

Still, this should be read as an extension test, not proof of real-world readiness. Atari-like environments are controlled. Symbol extraction is available through OCAtari. Options are engineered using modified environments and custom reward functions. The paper’s results justify attention, not procurement.

A sentence many AI vendors tragically forget.

The business lesson is curriculum design for agents

The practical lesson is not “use H2RL tomorrow.” It is more general: agent training should separate early guidance from final execution.

Many business agents are trained or configured as if the final deployed behavior should emerge from direct optimization over operational metrics. That is often a bad bargain. Early learning is where shortcuts become habits. Once an agent has discovered a cheap proxy, later refinement may not reliably erase it.

H2RL suggests a different pattern:

Business design question	H2RL-inspired answer
How do we prevent shortcut behavior?	Shape early learning with explicit domain logic and task-relevant subskills
How do we avoid slow symbolic inference in production?	Use logic during pretraining, not necessarily at inference
How do we make rules useful without hard-coding the whole policy?	Let rules select options and guide behavior, while the neural policy learns from interaction
How do we evaluate success?	Measure task progression, not only proxy reward
Where does the risk remain?	In badly designed rules, weak symbolic state extraction, and poor option definitions

For Cognaptus-style business automation, the natural analogy is process onboarding. A human employee does not learn procurement, compliance, or treasury operations by randomly clicking through enterprise software until a KPI improves. They start with rules, examples, checklists, escalation paths, and supervised subroutines. Later, they gain flexibility.

H2RL turns that common-sense training pattern into an RL architecture.

This is especially relevant for enterprise agents that must operate under procedural constraints: claims handling, invoice review, warehouse replenishment, financial monitoring, compliance triage, and trading operations. In these settings, a policy that learns “what usually increases the metric” is not enough. It must learn what sequence of intermediate states counts as real progress.

What the paper directly shows, and what we should infer carefully

The paper directly shows that logic-informed pretraining improves performance and task progression in selected Atari and Continuous Atari environments with deceptive rewards. It also shows, through ablations, that the improvement is not replicated by simply adding symbolic inputs, using a hierarchical neural manager, or relying on a pure logic manager.

Cognaptus can infer a practical architectural principle: symbolic rules can be valuable as training scaffolds even when they are too slow, too rigid, or too incomplete for deployment-time control. This matters for agentic systems where we want domain knowledge to influence behavior without turning every runtime decision into a symbolic planning exercise.

What remains uncertain is the cost and reliability of transferring this pattern to messy business domains. H2RL depends on several assets that are not free:

symbolic state representations;
useful logic rules;
pretrained option workers;
modified training environments;
evaluation metrics that detect real task progress rather than proxy success.

In Atari, those assets can be engineered. In a bank, hospital, factory, or trading desk, they require domain work. The model will not politely discover your compliance ontology because someone wrote “be safe” in the system prompt.

The boundary: bad rules can create beautifully wrong agents

The paper’s broader impact statement makes an important point: badly designed or biased logical rules may make an agent rigidly follow flawed paths. That is not a footnote-level concern. It is the main operational boundary of the approach.

Logic scaffolding is powerful because it shapes early learning. That also makes it dangerous when the scaffolding is wrong. A biased rule can become a behavioral prior. A poorly defined option can teach the agent an attractive but incomplete maneuver. A symbolic state extractor can omit the variable that actually matters.

This is different from ordinary prompt brittleness. Pretraining influence can become embedded in the learned policy. Once absorbed, the mistake is harder to inspect than a visible rule firing at runtime.

For business use, the governance implication is clear: if logic is used as training infrastructure, the logic layer must be treated as model-critical data. It needs versioning, review, testing, and adversarial evaluation. The fact that it disappears from inference does not mean it disappears from accountability. Very convenient, but no.

The better evaluation question is “where did the agent go?”

The strongest idea in the paper is not the largest number in the table. It is the shift from reward-only evaluation to path-aware evaluation.

In Kangaroo, the authors ask whether agents reach higher floors. That metric checks whether the policy is moving through the intended task structure. For business agents, the equivalent question is not merely “did the KPI improve?” It is:

Did the agent pass the required intermediate checks?
Did it solve the customer’s issue or merely close the ticket?
Did it reduce risk or merely reduce reported risk?
Did it follow the operational sequence that makes the outcome auditable?
Did it preserve optionality for later decisions?

H2RL’s mechanism and evaluation point in the same direction. If the task has structure, train with structure and evaluate structure. Reward curves alone are too easily flattered.

Conclusion: rules are more useful as teachers than as handcuffs

H2RL is a useful paper because it avoids the stale argument between “neural networks learn everything” and “symbolic reasoning will save us.” It gives logic a more modest and more useful job: teach early, then step aside.

That is a mature framing. Symbolic systems are good at expressing domain intent, sequencing, and constraints. Neural policies are good at fast perception-action mapping and refinement through interaction. H2RL combines them by using symbolic logic and options as scaffolding during pretraining, then letting the neural policy continue alone.

For business AI agents, the lesson is not that every workflow needs a differentiable logic manager tomorrow morning. The lesson is that the early training environment matters. If agents are trained only against shallow measurable rewards, we should not be surprised when they become excellent at shallow measurable behavior.

The future of agent design may therefore look less like giving models bigger reward functions and more like giving them better apprenticeships.

Cognaptus: Automate the Present, Incubate the Future.

Zihan Ye, Phil Chau, Raban Emunds, Jannis Blüml, Cedric Derstroff, Quentin Delfosse, Oleg Arenz, and Kristian Kersting, “Boosting Deep Reinforcement Learning using Pretraining with Logical Options,” arXiv:2603.06565v1, 2026. ↩︎

The real failure is not low reward but wrong reward-following#

H2RL uses logic as scaffolding, not as permanent bureaucracy#

The gate is where the curriculum becomes operational#

The main results are large, but the interpretation is narrower than the scoreboard#

The ablation says “not just symbols, not just hierarchy”#

Continuous action tests make the idea more interesting, not universal#

The business lesson is curriculum design for agents#

What the paper directly shows, and what we should infer carefully#

The boundary: bad rules can create beautifully wrong agents#

The better evaluation question is “where did the agent go?”#

Conclusion: rules are more useful as teachers than as handcuffs#