Proof, Policy, and Probability: How DeepProofLog Rewrites the Rules of Reasoning

Proofs are supposed to be the respectable part of AI: tidy, inspectable, and resistant to the usual neural-network fog machine. Then reality turns up, as it so often does, carrying a bill.

In neurosymbolic AI, the bill is search. A system may know the rules. It may even combine them with neural perception. But if answering a query requires enumerating a vast space of possible proofs, the promise of “interpretable reasoning” quickly becomes a very elegant way to run out of time.

That is the problem taken up by DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs.¹ The paper introduces DeepProofLog, or DPrL, a neurosymbolic framework that does not merely attach neural networks to a logic program. It changes the object being learned. Instead of asking a model to assign probabilities to worlds, or to grind through all possible derivations after the fact, DPrL learns how a prover should move through proof space.

That sounds like a subtle distinction. It is not. It is the paper’s central move.

DPrL treats proof search as behaviour. A derivation becomes a trajectory. A resolution step becomes an action. The prover becomes a policy. Once that translation is made, the machinery of Markov Decision Processes, dynamic programming, and reinforcement learning can be applied to symbolic reasoning without losing the thing symbolic reasoning is useful for in the first place: an actual proof trace.

For business readers, the practical implication is not “AI can now reason”, which is the sort of sentence that should be fined for public nuisance. The implication is narrower and more interesting: systems that already have useful rules, schemas, ontologies, or relational structure may be able to search through them more efficiently while still producing inspectable reasoning paths.

That is a much better deal.

The expensive part is not always probability; it is finding the proofs

Many probabilistic logic systems are built around a two-stage pattern. First, use symbolic proving to find proofs for a query. Second, perform probabilistic inference over those proofs to compute how likely the query is.

The paper argues that much of the field has focused on scaling the second stage. That is understandable. Probabilistic inference over possible worlds is famously hard; the paper notes that this can correspond to weighted model counting, which is #P-hard. But this leaves a rather inconvenient assumption untouched: that the system can find the relevant proofs cheaply enough to begin with.

In small toy domains, maybe. In large relational settings, no. There the number of possible derivations can explode, and the system can drown before probability calculation even has its moment of theatrical suffering.

DeepProofLog’s first correction is semantic. It builds on stochastic logic programs, which define probability distributions over derivations rather than over possible worlds. In other words, the probability attached to a proof step is not a claim that a fact is true in some world. It is a claim about the behaviour of a prover: given the current goal, which derivation step is it likely to take?

That distinction matters because it prevents a common misreading of the paper. DPrL is not primarily modelling uncertainty about reality. It is modelling uncertainty over proof behaviour. The world is represented by the logic program and subsymbolic evidence; the learned distribution guides how the prover navigates that structure.

The business version is simple: DPrL is less like a probabilistic database saying “this fact has a 0.74 chance of being true” and more like a trained analyst saying “given this question and these rules, this is the next line of reasoning worth pursuing”.

One is world uncertainty. The other is reasoning strategy. Confuse them and the whole paper becomes fuzzier than it needs to be.

DeepProofLog turns resolution into a policy

The core mechanism begins with SLD resolution, the standard proof procedure behind many logic programming systems. A query is represented as a goal. At each step, the prover selects a clause and substitution that transforms the current goal into a new goal. Repeat until the goal is proven, fails, or loops into the usual swamp.

DPrL parameterises that transition:

$$ p(G_{i+1} \mid G_i) $$

Here, $G_i$ is the current goal, and $G_{i+1}$ is a candidate next goal after one resolution step. Rather than assigning fixed clause probabilities, DPrL embeds the current goal and candidate next goals, scores their compatibility, and uses those scores to define a probability distribution over the next proof step.

The important part is goal-level conditioning. Classical stochastic logic programs assign scalar probabilities to clauses. DeepStochLog, a prior neural stochastic logic system, conditions neural predictions on subsymbolic terms and tends to require a fixed output domain. DPrL instead conditions each decision on the full current goal, letting the model adapt to context and to variable output spaces.

That gives the prover something closer to situational awareness. It is not simply asking, “How likely is this clause in general?” It is asking, “Given the proof state I am currently in, which next state looks compatible with success?”

A compact comparison helps:

System type	What is parameterised	What it misses	Operational consequence
Stochastic Logic Programs	Fixed clause probabilities	Query-specific context and subsymbolic variation	Simple, but blunt
DeepStochLog	Neural probabilities tied to subsymbolic terms and fixed output domains	Flexible substitution spaces and full-goal context	More expressive, but still constrained
DeepProofLog	Neural transition probabilities between current and next goals	Still uses leftmost-atom selection in this implementation	Context-aware proof guidance with broader scalability options

This is the mechanism-first reading of the paper: DPrL’s contribution is not merely “a new model does well on benchmarks”. The contribution is that proof search is recast as sequential decision-making.

Once that happens, the next step is almost inevitable.

The MDP mapping is the paper’s real hinge

The authors formalise DPrL’s proof process as a Markov Decision Process. In that mapping:

states are goals;
actions are possible resolution steps, including rule applications and substitutions;
transitions are deterministic, because applying a selected resolution step produces a specific next goal;
rewards are assigned at terminal outcomes, depending on whether a positive or negative query is successfully proven;
the policy is the learned distribution over proof actions.

The reward design is deliberately aligned with neurosymbolic learning. For positive queries, successful proofs should receive high probability. For negative queries, successful proofs should receive low probability. The paper then shows that maximising expected return in the MDP corresponds to minimising the learning objective used for DPrL, and aligns with the direction of cross-entropy optimisation used in comparable neurosymbolic systems.

That alignment is not just mathematical housekeeping. It is what makes the framework practical. It means the logic-programming objective and the reinforcement-learning objective are not pulling in different directions. The policy is not learning a game that merely resembles proof search. It is learning the proof-search behaviour that the probabilistic logic objective wants.

From there, DPrL has two routes.

When the induced goal space is manageable, dynamic programming can compute exact inference efficiently. When the goal space is too large, reinforcement learning can approximate a useful policy from sampled derivations. The paper uses PPO for knowledge graph reasoning and an off-policy REINFORCE variant for the MNIST addition experiment.

This gives DPrL a useful split personality: exact where structure permits, approximate where structure explodes. Finally, a split personality with a technical reason.

The MNIST addition test is about scalable latent reasoning, not digit recognition

The first experiment uses MNIST addition, a standard neurosymbolic benchmark. The model sees two numbers represented as sequences of MNIST digit images and receives only the final sum as supervision. It does not receive explicit labels for individual digits.

That makes the task a clean test of latent reasoning. The system must learn perception through a symbolic constraint: the images must correspond to digits that add up to the observed sum.

The experiment’s likely purpose is main evidence for scalability under increasing proof complexity. As the sequence length $N$ grows, the space of possible digit assignments grows rapidly. DPrL is tested in two variants:

DPrL (DP) uses dynamic programming for exact inference over a structured digit-wise addition formulation with carry propagation.
DPrL (PG) uses a policy-gradient approach for approximate inference.

The results are not subtle.

MNIST sequence length	DPrL (DP) accuracy	DPrL (DP) time	Key interpretation
4	94.0 ± 0.3	25.4 ± 6.3 min	Comparable or better accuracy with much faster training than exact baselines
15	80.8 ± 1.1	41.8 ± 3.7 min	Scales where several baselines already time out or become slow
100	37.8 ± 2.9	105.8 ± 21.8 min	Reaches a regime where previous methods time out
200	23.2 ± 3.0	128.0 ± 33.3 min	Maintains non-trivial performance at larger proof depth
500	6.0 ± 4.9	469.0 ± 51.6 min	Still runs, though accuracy is naturally much lower

The paper applies a 24-hour timeout. It reports that no previous method successfully scales to $N=100$ without timing out. DPrL (DP), by contrast, continues through $N=500$.

The mechanism explains why. The authors formulate addition recursively with carry propagation. In their appendix, they note that the goal space grows linearly with $N$ under this encoding. That is the key: dynamic programming is not magic dust sprinkled on an exponential problem. It works because the symbolic formulation exposes a tractable recursive structure.

That is also the business lesson. If a domain can be reformulated so that intermediate reasoning states are reusable, exact inference may become feasible. If the domain cannot be so structured, dynamic programming will not politely ignore combinatorics just because a slide deck says “neurosymbolic”.

The policy-gradient version tells a different story. It trains efficiently and reaches strong results at small $N$: 93.5 ± 0.4 at $N=4$ and 76.9 ± 1.9 at $N=15$. But at $N=100$, $200$, and $500$, the reported accuracy is 0.0 despite finite training times. The paper explicitly states that this variant scales computationally beyond $N=100$ but converges to a suboptimal policy because the combinatorial search space is vast.

That is not a failure of the paper. It is useful evidence. It says approximate proof-policy learning can keep running when exact enumeration is impossible, but running is not the same as solving. An RL-guided prover still needs a learnable search landscape. We should not have needed a benchmark to learn that, but apparently civilisation enjoys repetition.

The knowledge graph experiment tests whether proof policies survive real relational scale

The second experiment moves from digit sequences to knowledge graph completion. This is closer to where business applications start to look plausible: enterprise knowledge bases, ontologies, relationship-heavy databases, compliance graphs, product hierarchies, biomedical relations, and the hundred other places where organisations hide logic inside spreadsheets and then act surprised when nobody can query it.

The paper evaluates DPrL on two benchmarks: Family and WN18RR. Family contains 3,007 entities, 12 relations, 19,845 facts, and 48 rules. WN18RR contains 40,559 entities, 11 relations, 86,835 facts, and 28 rules. The experiment’s likely purpose is both main evidence and comparison with prior work: can DPrL produce proof-based predictions at knowledge-graph scale without relying on exhaustive grounding?

The answer is: broadly yes, with nuance.

Dataset	Best relevant DPrL result	What the evidence supports
Family	DPrL reaches MRR 0.986, the strongest MRR in the table	DPrL can match or exceed strong proof-based methods on a structured relational benchmark
WN18RR	DPrL reaches MRR 0.834, close to RotatE at 0.833 and below SBRSG at 0.840	DPrL is competitive, but not a clean winner across every metric
Family grounding comparison	DPrL uses 12,334 ground evaluations versus smart grounding methods instantiating 18,071, 45,466, and 181,095 groundings at depths 1, 2, and 3	Learned proof navigation can avoid large-scale exhaustive grounding
Proof examples	DPrL generates multi-step proof trees for WN18RR and Family	Predictions remain inspectable at the proof level

The strongest evidence is not that DPrL crushes every baseline on every metric. It does not. On WN18RR, SBRSG slightly outperforms DPrL on MRR and Hits@10; RotatE is extremely competitive; DPrL’s Hits@1 is slightly below RotatE. The table is more nuanced than the usual abstract-friendly phrasing.

The stronger claim is operational: DPrL reaches the top performance cluster while preserving proof-level interpretability and reducing the need for broad grounding. In the Family test set, the comparison against smart grounding methods is particularly important. As proof depth increases, grounding-based systems instantiate rapidly more formulas. DPrL instead learns a policy for navigating proof space and needs far fewer ground evaluations in the reported comparison.

That is the practical signal. Enterprises rarely care whether a model wins a benchmark by 0.001 MRR. They care whether the system can avoid blowing up when rules deepen, whether it can explain a decision, and whether adaptation requires hand-tuned grounding tricks each time a domain changes.

DPrL does not eliminate all engineering burden. But it shifts part of the burden from manual grounding design to learned exploration strategy.

The proof trace is the product feature

Interpretability in AI is often sold as a dashboard afterthought: attach a heatmap, call it transparency, leave before anyone asks what it means. DPrL’s interpretability is different because the model’s decision is expressed through a derivation.

In the knowledge graph experiment, the paper shows proof trees for a WN18RR query and a Family query. The Family example includes an uncle relation derived through intermediate family relations. The point is not that these specific relations are commercially dramatic. They are not. The point is that the answer is tied to a symbolic path that can be inspected.

For enterprise systems, that matters in domains where answers need more than confidence scores:

Business domain	Why proof-level reasoning matters	What DPrL suggests
Compliance and policy reasoning	Decisions must be traceable to rules and exceptions	A learned prover could prioritise likely rule paths without hiding the derivation
Enterprise knowledge graphs	Entity relations are sparse, noisy, and rule-rich	Proof policies may reduce grounding costs while retaining explanation paths
Scientific databases	Queries often combine observed data with formal domain constraints	Symbolic proofs can expose why a conclusion follows
Financial and operational controls	Rules interact across entities, thresholds, and temporal conditions	Learned proof search could help navigate complex rule interactions
Legal and contract analytics	Clauses and obligations form relational structures	Inspectable proof paths are more useful than opaque classification alone

The careful wording is intentional: “suggests”, not “proves”. The paper directly shows benchmark performance on MNIST addition and two KG datasets. It does not prove enterprise deployment readiness, nor does it test messy production ontologies, changing rules, adversarial data quality, or human audit workflows.

Still, the pathway is clear. If a business already has formal rules or can encode useful ones, DPrL points toward a way of making reasoning systems more scalable without surrendering proof traces. That is a real pathway, even if the road still contains potholes, procurement, and at least one committee.

The misconception: DPrL is not possible-world probability with better branding

The most important conceptual boundary is the probability semantics.

Possible-world probabilistic logic asks, roughly, which worlds are likely and whether the query holds in them. DPrL’s derivation-based view asks which proof paths a stochastic prover is likely to take. The clause probability is about use in a derivation, not truth in the world.

This matters when interpreting outputs. A high-probability proof path is not automatically a calibrated belief about reality. It is a learned reasoning route under the program, data, and policy. That can still be extremely useful. Search engines rank pages without claiming metaphysical certainty about the universe. But the distinction must be maintained, especially in regulated settings.

The model is best understood as a trainable proof navigator. It can help find and weight derivations. It does not absolve the organisation from validating the rules, checking data quality, calibrating probabilities, or deciding what kind of uncertainty matters.

A useful internal checklist would separate four questions:

Question	DPrL helps with?	What remains external
Can the system find plausible proof paths efficiently?	Yes, that is the central contribution	The logic program must encode useful structure
Can the system produce inspectable derivations?	Yes, especially in proof-based KG settings	Human-facing explanation design still matters
Does the probability mean the fact is true in the real world?	Not directly	Calibration and semantic interpretation remain separate
Can it scale to any arbitrary rule base?	No	State-space design, program structure, and RL stability still matter

That last row is where many product pitches go to die, as they should.

The appendix tells us what kind of scalability is being claimed

The appendix is not decorative. It clarifies the nature of the scalability claims.

For MNIST addition, the dynamic programming result depends on a structured factorisation: digit-wise addition with carry propagation. The paper states a broader design lesson: if a neurosymbolic problem can be encoded recursively such that the only non-deterministic atom is the next goal and the substitution space is tractable, then DP can solve it efficiently.

That is an implementation detail, but it also functions as a boundary condition. DPrL’s exact-inference advantage is strongest when symbolic structure can be arranged into a manageable state space. This is not automatic. It depends on representation design.

For reinforcement learning, the appendix describes a memory-enhanced environment that prevents loops by tracking visited goals within a derivation episode. It also adds a special False action, allowing the agent to abandon a proof early and preserving stochastic behaviour even when only one clause is unifiable. These are not headline contributions, but they are important engineering details. Proof search without loop control is how one discovers that infinity has poor user experience.

For knowledge graphs, the appendix notes that the KG experiments use a fixed seed due to cost. It also specifies that all NeSy methods, including DPrL, use RotatE output as a prior and learn a logic-based adjustment on top. That matters because the KG results are not pure proof search from scratch. They are proof-guided adjustment over a strong embedding baseline.

This does not invalidate the experiment. It makes the interpretation sharper.

What the paper directly shows, and what business should infer

The article-level business interpretation should stay disciplined.

Layer	Statement
What the paper directly shows	DPrL can parameterise proof steps with neural goal-conditioned policies, map stochastic SLD resolution to an MDP, use DP or policy gradients for inference and learning, scale MNIST addition farther than compared NeSy baselines under the paper’s setup, and produce competitive proof-based KG completion results with inspectable proof trees.
What Cognaptus infers	For rule-heavy enterprise systems, learned proof navigation may be a practical middle path between exhaustive symbolic grounding and opaque neural prediction.
What remains uncertain	Robustness across messy real-world logic programs, calibration of proof probabilities, multi-seed KG stability, integration with changing enterprise schemas, and whether stronger RL methods improve hard proof spaces.

The central ROI story is not that DPrL makes reasoning cheaper in every setting. It is that it may reduce the cost of search and diagnosis in settings where rules already matter.

A compliance team does not only want an answer. It wants to know which rule chain led to the answer. A knowledge-graph team does not only want another embedding score. It wants to know whether a relation can be supported by a meaningful path. A scientific database team does not only want a classifier. It wants a reasoned query trace that can be checked.

DPrL is relevant because it keeps that trace inside the model’s operating logic rather than bolting it on afterwards.

The boundaries are practical, not ceremonial

The paper’s limitations are concrete.

First, DPrL still depends on a useful logic program. If the rules are poor, incomplete, contradictory, or politically negotiated into mush, the model will not rescue them. It will navigate the structure it is given. A bad map plus a smart driver still ends at the wrong building.

Second, exact dynamic programming only helps when the induced goal space is manageable. The MNIST addition result works because the task admits a recursive formulation with carry propagation. That is a powerful demonstration, but not a universal guarantee.

Third, approximate policy-gradient inference can fail in large combinatorial spaces. The MNIST PG results make this visible: the method continues to train at large $N$, but accuracy collapses to 0.0. Computational scalability without useful convergence is a very polite kind of failure.

Fourth, the current method restricts clause selection to the leftmost atom. The authors note that this preserves expressivity and completeness, but more flexible selection could prune failing derivations more efficiently. This is a real design constraint because atom-selection strategy can strongly affect search cost.

Fifth, the KG experiments rely on RotatE as a prior and use a fixed seed due to computational expense. The reported results are encouraging, but they should not be read as full robustness evidence across random initialisations, rule sets, or production-scale knowledge graphs.

These boundaries do not make the paper weaker. They make it usable. A claim with edges is preferable to a claim floating around the room pretending to be a strategy.

DeepProofLog’s broader signal: reasoning is becoming trainable infrastructure

DeepProofLog is part of a broader shift in AI systems design: reasoning is no longer treated as a separate symbolic module that either runs exhaustively or not at all. It becomes a trainable process. The symbolic structure remains, but the path through it is learned.

That is the interesting synthesis. Neural networks contribute prioritisation, adaptation, and subsymbolic compatibility. Logic contributes compositional structure, proof semantics, and inspectable derivations. Reinforcement learning contributes the language of policy optimisation over sequential decisions.

None of these components is new by itself. The paper’s value is in aligning them cleanly around stochastic SLD resolution.

For businesses, the next question is not whether to “adopt DeepProofLog” tomorrow morning. The more useful question is where proof search is currently the bottleneck. Look for systems where:

rules already exist;
queries involve multi-step relational reasoning;
exhaustive grounding is expensive;
explanations matter;
and approximate black-box answers are not enough.

That is where DPrL’s framing becomes strategically interesting.

It does not promise an omniscient reasoner. It promises something less glamorous and more valuable: a way to train the search behaviour of a symbolic prover.

In enterprise AI, that may be exactly the unsexy missing piece.

Cognaptus: Automate the Present, Incubate the Future.

Ying Jiao, Rodrigo Castellano Ontiveros, Luc De Raedt, Marco Gori, Francesco Giannini, Michelangelo Diligenti, and Giuseppe Marra, “DeepProofLog: Efficient Proving in Deep Stochastic Logic Programs,” arXiv:2511.08581, 2026. https://arxiv.org/abs/2511.08581 ↩︎

The expensive part is not always probability; it is finding the proofs#

DeepProofLog turns resolution into a policy#

The MDP mapping is the paper’s real hinge#

The MNIST addition test is about scalable latent reasoning, not digit recognition#

The knowledge graph experiment tests whether proof policies survive real relational scale#

The proof trace is the product feature#

The misconception: DPrL is not possible-world probability with better branding#

The appendix tells us what kind of scalability is being claimed#

What the paper directly shows, and what business should infer#

The boundaries are practical, not ceremonial#

DeepProofLog’s broader signal: reasoning is becoming trainable infrastructure#