TL;DR for operators
A new paper by Jose Marie Antonio Miñoza, Erika Fille T. Legara, and Christopher P. Monterola argues that a log-sum-exp neural layer is not merely analogous to a viscous Hamilton-Jacobi equation. Under the paper’s parameterisation, it is exactly the Hopf-Cole solution of one, evaluated at the input point.1
The operational point is not “neural networks are physics now”, although someone will certainly try to put that on a slide. The point is cleaner: one parameter, $\varepsilon$, simultaneously controls softmax temperature, PDE viscosity, and entropy-regularised convex optimisation. That makes smoothness, expressiveness, robustness, attribution sharpness, and scaling behaviour mathematically coupled.
For business readers, this reframes several model-design choices. Temperature is not just a calibration knob. Architecture is not just a leaderboard habit. Width is not just capacity. Each becomes part of a discretisation-and-regularisation budget: how finely the model covers the data measure, how sharply it selects local evidence, how much curvature it permits, and how badly it behaves when queried outside the support of its training distribution.
The paper’s numerical experiments mainly verify identities and consequences in controlled settings: machine-precision checks for the LSE/Hopf-Cole identity, quadrature convergence, scaling behaviour under synthetic and trained LSE settings, Hessian robustness bounds, and exploratory attribution phase diagrams. They are not evidence that today’s production transformers suddenly inherit guaranteed robustness because someone noticed Hamilton liked equations.
The boundary matters. The exact result applies most directly to log-sum-exp layers and quadratic or anisotropic-quadratic Hamiltonians. ResNets, recurrent models, state-space models, and transformer components are connected structurally, with specific exact pieces and specific gaps. Standard multi-head causal transformer stacks with GELU or SiLU activations are not fully swallowed by the theorem. Not everything with a softmax is secretly a solved PDE. Some of it is merely wearing a PDE-adjacent jacket.
The ordinary knob that becomes the whole mechanism
Most AI teams treat temperature as a behavioural control. Increase it and the model becomes softer, more exploratory, less brittle. Decrease it and the model becomes sharper, more decisive, occasionally too pleased with itself.
That engineering intuition is not wrong. It is just shallow.
The paper’s central move is to show that, for log-sum-exp layers, the same parameter that softens a neural computation also plays the role of viscosity in a Hamilton-Jacobi partial differential equation and the role of entropy regularisation in a convex optimisation problem. This is the mechanism from which the rest of the paper follows.
The familiar log-sum-exp layer has the form:
At finite $\varepsilon$, all neurons contribute through softmax weights. As $\varepsilon \to 0$, the expression collapses toward a hard maximum. The paper places that familiar soft-to-hard transition inside a larger mathematical square:
| Viewpoint | Finite $\varepsilon$ | Limit as $\varepsilon \to 0$ | Operational reading |
|---|---|---|---|
| Neural network | Log-sum-exp / softmax | Max / hard selection | Smooth ensemble becomes winner-take-all routing |
| Algebra | Ordinary smooth deformation | Tropical max-plus algebra | Arithmetic becomes selection geometry |
| PDE | Viscous Hamilton-Jacobi | Inviscid Hamilton-Jacobi | Diffusion-smoothed solution becomes shock-prone solution |
| Optimisation | Entropy-regularised convex programme | Linear programme at a vertex | Soft allocation becomes sparse vertex choice |
This is why the paper is best read mechanism-first. The authors do not simply collect analogies between neural networks, PDEs, tropical algebra, and optimisation. They argue that these are four descriptions of the same object under a deformation controlled by $\varepsilon$.
That does not make the result easy. It makes it useful. Once $\varepsilon$ has four simultaneous meanings, several separate model-quality conversations stop being separate. Robustness, attribution, smoothness, scaling, and architecture choice start sharing a denominator. Annoying, but clarifying.
The exact claim is about LSE layers solving an initial-value problem
The strongest result is the paper’s exact identity for log-sum-exp layers. The authors show that an LSE layer can be reparameterised as the Hopf-Cole solution of a viscous Hamilton-Jacobi initial-value problem under a discrete measure.
Translated out of PDE dialect: the weights encode support points and initial data; the layer width discretises a measure; the input is the spatial point where the solution is evaluated; and the forward pass evaluates the resulting PDE solution.
This is not a physics-informed neural network story. PINNs impose a PDE through a residual loss and train a network to approximate the PDE solution. Here, for the LSE layer, the PDE is already in the layer’s algebra. The residual is not pushed down during training; the identity is exact by construction.
That difference matters.
A PINN says: “Please learn this equation.” This paper says: “You already are an equation. The question is which one.”
The authors describe training as a search over Hamilton-Jacobi initial-value problems. For any fixed trained LSE layer, the PDE interpretation is precise: the learned parameters define the initial data. The more delicate claim is about how stochastic gradient descent selects that initial-value problem. The paper gives mean-field and fixed-support results, but the full selection story for deep, finite-width, mini-batch-trained practical networks remains open. We will return to that boundary, because it is where many breathless interpretations should be quietly escorted out.
The tropical limit explains why sharp models become selectors
The limit $\varepsilon \to 0$ is not just “temperature gets low”. It is a structural phase change.
At finite $\varepsilon$, the LSE layer computes a Gibbs-weighted average. Multiple neurons influence the output. Attribution is distributed. Curvature is smoothed by viscosity.
As $\varepsilon$ shrinks, the Gibbs measure concentrates on the dominant term. The network behaves like a max-affine spline operator: one neuron wins in each region, and the input space is partitioned into polyhedral cells. In PDE terms, the viscous Hamilton-Jacobi equation approaches its inviscid Hopf-Lax form. In optimisation terms, the entropy-regularised programme collapses to a linear programme at a vertex.
This gives a disciplined explanation for a behaviour practitioners already observe. Sharp models can be expressive, but their decisions may hinge on abrupt local switches. Smooth models are less brittle, but they can over-diffuse distinctions that matter.
The paper formalises this trade-off through curvature bounds. In particular, it derives a Hessian bound showing that increasing $\varepsilon$ suppresses curvature and enlarges a certified adversarial radius. Decreasing $\varepsilon$ sharpens the model but narrows that radius. This is not a moral argument for smoothness. It is a control budget.
For operators, the question becomes less childish than “Should we maximise accuracy or robustness?” The better question is:
What viscosity budget should this system run under, given its risk class, data coverage, and acceptable loss of local discrimination?
That is the kind of question organisations can actually govern.
Architecture becomes a discretisation choice, not a fashion cycle
The paper’s second major contribution is to show how several common architectures fit into the same Hamilton-Jacobi frame. The level of exactness varies, and that variation is the whole story.
Feedforward LSE layers have the cleanest result: exact Hopf-Cole identity under a discrete measure. A deep LSE network corresponds to repeated application of the PDE semigroup, with finite-depth and finite-width approximation errors quantified in the appendices. The composition is not simply hand-waved; the paper gives finite-depth error and joint-limit exactness results.
ResNets enter through a different route. A residual layer is interpreted as an Euler discretisation of an ODE characteristic of a Hamilton-Jacobi PDE. The forward pass follows the characteristic dynamics. Backpropagation becomes the co-state equation, integrated backward in time. In control language, the update aligns with the Pontryagin Maximum Principle. In less ceremonial terms: reverse-mode differentiation is not a mystical afterthought; it is the adjoint side of the same dynamical system.
Transformers are more mixed. Scaled dot-product attention is read as an expected value under a Gibbs distribution. L2 attention has a cleaner exact Hopf-Cole interpretation. An LSE-activated transformer feed-forward sublayer satisfies the exact LSE identity. Residuals are structural. LayerNorm does not break the attention identity, but it defines the inputs to it. Standard transformer blocks with GELU or SiLU feed-forward activations are not fully exact HJ solvers. Multi-head attention and causal masking introduce additional complications.
Recurrent architectures and state-space models fit structurally as characteristic discretisations. RNNs become unit-step Euler schemes. Linear SSMs become linear Hamilton-Jacobi characteristic systems. LSTMs are described through gated, time-dependent Hamiltonians, with gates acting as gradients of two-neuron LSE components and becoming selectors in the tropical limit.
The practical interpretation is compact:
| Architecture class | Paper’s reading | Strength of claim | Business consequence |
|---|---|---|---|
| LSE feedforward layers | Exact Hopf-Cole solution under discrete measure | Exact for quadratic or anisotropic-quadratic Hamiltonians | Temperature, width, and robustness can be analysed as linked controls |
| Deep LSE networks | Composition approximates PDE semigroup | Quantified finite-depth and width error; exact in joint limit | Depth and width enter an approximation budget |
| ResNets | Euler discretisation of HJ characteristics | Structural, with exact adjoint/backprop interpretation in stated setting | Architecture choice becomes numerical integration choice |
| Attention | Gibbs expectation; L2 attention has exact Hopf-Cole form | Exact for specific attention identities; broader transformer stack mixed | Attention sharpness and sinks can be read as measure concentration |
| RNNs, LSTMs, SSMs | Characteristic discretisations with architecture-specific viscosity | Structural correspondences | Sequence models differ by how they inject and propagate viscosity |
| GELU / SiLU networks | Related finite-temperature structures | Not exact under current identity | Do not overclaim PDE guarantees for standard production activations |
This table is where the misconception should die. The paper is not saying every modern network literally solves an arbitrary Hamilton-Jacobi PDE exactly. It says one class does, several components align strongly, and broader architectures inherit structural correspondences with clear unresolved gaps.
The difference is not pedantry. It is the difference between a theory and a consultancy deck.
What the experiments actually support
The numerical section is not a benchmark race. It is mostly a verification suite.
That is appropriate. A paper making exact mathematical claims should first show that the identities survive implementation, that the rates behave as predicted, and that derived bounds are not immediately contradicted by computation.
The experiments have different evidentiary roles:
| Test or figure | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| LSE/Hopf-Cole identity checks across $\varepsilon$ | Main identity verification | The algebraic identity holds to machine precision in numerical implementation | That arbitrary non-LSE production networks inherit the identity |
| Transformer attention identity checks | Implementation-detail verification for attention correspondence | The stated attention identity holds in random trials and LSE-transformer block checks | That full causal multi-head transformers are exact HJ solvers |
| Quadrature convergence | Main evidence for approximation-rate behaviour | Width improves the PDE approximation as predicted | That real-world model scaling is fully explained by the bound |
| Scaling-law experiments on synthetic and Adam-trained LSE networks | Main evidence plus controlled extension | The predicted width exponent appears in controlled LSE settings | That internet-scale LLM scaling follows the formula without confounds |
| Hessian and adversarial robustness checks | Robustness and sensitivity test | The curvature bound is not violated in closed-form, trained, MNIST, and CIFAR-10 projected settings | That raw-pixel or learned-representation robustness is solved |
| Attribution entropy phase diagrams | Exploratory extension | The entropy landscape undergoes fold-like basin-merging behaviour in synthetic and projected real-data settings | That attribution in all deployed models can be read directly from these landscapes |
| Intrinsic-dimension table from published scaling curves | Comparison with prior work and interpretive bridge | Empirical scaling exponents can be converted into rough implied intrinsic dimensions | That those inferred dimensions are certified properties of the data manifold |
Two details are worth not missing.
First, the real-data robustness checks use PCA-projected MNIST and CIFAR-10 representations. That makes them useful as sensitivity evidence for the bound, but not a raw-input robustness certificate for production vision systems.
Second, the paper’s table converting published scaling exponents into implied intrinsic dimensions is explicitly directional. It estimates order-of-magnitude geometry from empirical loss-vs-parameter curves. The reported examples include under-trained GPT-scale language with exponent $0.076$ and implied dimension $13.2$, compute-optimal language with exponent $0.35$ and implied dimension $2.9$, video with exponent $0.24$ and implied dimension $4.2$, and math with exponent $0.38$ and implied dimension $2.6$. Those numbers are interesting because they connect scaling behaviour to data geometry. They are not licence to walk into a board meeting and declare that language has precisely 2.9 dimensions. Please do not.
The business value is coupled diagnosis, not mystical explanation
The commercial relevance of this paper is not that executives should learn Hamilton-Jacobi theory before approving a model deployment. Most of them have suffered enough.
The value is diagnostic. The framework gives operators a way to see several familiar AI trade-offs as one coupled design surface.
| Paper result | Cognaptus business interpretation | Boundary |
|---|---|---|
| $\varepsilon$ is softmax temperature, PDE viscosity, and entropy regularisation | Treat temperature as a governance-relevant design variable, not post-hoc seasoning | Strongest for LSE and softmax-based components |
| Higher $\varepsilon$ suppresses curvature and improves certified perturbation radius | Robustness can be tuned through smoothness, with a measurable loss of sharp discrimination | Certificates are tied to the paper’s setting and representation assumptions |
| Width discretises the measure; approximation depends on intrinsic dimension | Compute planning should consider data geometry, not just parameter count | Intrinsic dimension estimates from scaling curves are rough and confounded |
| OOD behaviour becomes dominant-neuron extrapolation outside support | Alignment and safety require support coverage, OOD penalties, or viscosity control | This is a structural model of extrapolation, not a full theory of language hallucination |
| Backpropagation is an adjoint Hamiltonian flow for ResNets | Training algorithm design can borrow from numerical PDE and control methods | Convergence destination under practical SGD remains open |
| Attention sinks resemble dominant neurons in the tropical limit | Some attention pathologies can be read as measure concentration rather than mysterious emergent behaviour | Multi-head causal transformer stacks are not fully covered by the exact theorem |
The result is especially relevant for AI governance and model-risk teams because it replaces vague behavioural discussion with structural levers. A system can be asked: How much curvature are we allowing? How dense is the support coverage? How does the chosen architecture discretise the relevant process? Where does the input fall relative to the learned measure? What happens as the model moves toward hard selection?
These questions do not guarantee safety. They make failure modes less literary.
Scaling laws become geometry claims wearing compute clothing
The paper’s scaling-law argument is one of its most business-relevant pieces, but it needs careful handling.
The derivation connects approximation error to quadrature over a data measure. If the data lies on or near a lower-dimensional manifold, the relevant dimension is not the ambient input dimension but the intrinsic dimension of the support. The width-only scaling exponent can then be interpreted through this intrinsic dimension.
This is a useful reframing of compute planning. It says the economic cost of performance improvement depends not only on how large the model is, but on how hard the data distribution is to cover. A broad, messy, high-dimensional support requires many more atoms. A constrained, structured domain needs fewer.
This is where small specialised models become more than a cost-saving slogan. If a domain has lower intrinsic dimension, specialisation is not merely cheaper because the model is smaller. It is cheaper because the approximation problem is geometrically easier.
The paper’s Appendix J and K translate this into design principles: estimate intrinsic dimension from pilot scaling curves, match $\varepsilon$ to width and dimension, use architectural inductive bias to reduce effective dimension, curate data to lower support complexity, and treat mixture-of-experts or distillation as ways of partitioning or compressing the learned measure.
Cognaptus inference: this supports a more disciplined alternative to brute-force scaling. Before buying more compute, ask whether the problem is actually high-dimensional, whether the architecture is wasting capacity on irrelevant symmetries, and whether the training corpus is expanding the support or just adding noisy repetitions to it.
Boundary: the paper does not prove a universal cost model for frontier training. Empirical scaling exponents conflate approximation, optimisation, data quality, and training allocation. The intrinsic-dimension reading is powerful, but it is not an invoice from geometry.
OOD behaviour is not mysterious when the measure runs out
The paper’s treatment of hallucination is deliberately structural. It defines an out-of-distribution extrapolation regime in which the input lies outside the diffusion radius of every support point. In that regime, the output becomes exponentially close to the linear extrapolation of the dominant neuron.
This is not a complete account of language-model hallucination. The authors are explicit that the term is used informally and that language-model phenomena include other mechanisms. Still, the result captures a critical operational fact: outside the learned support, the model is no longer governed by local evidence from the training measure. It is governed by whichever learned component dominates the extrapolation geometry.
That has business consequences.
A compliance model queried on an unseen regulatory scenario may produce an answer that is locally smooth, syntactically respectable, and structurally ungrounded. A medical triage model presented with an unusual patient profile may confidently extend a nearby learned rule. A finance model may interpolate beautifully until market structure changes, then discover its inner philosopher.
The framework suggests three levers:
- Expand support coverage with diverse and targeted training data.
- Use viscosity control to smooth sharp extrapolation and reduce sensitivity.
- Add OOD penalties or synthetic boundary cases so risky regions become part of the training measure rather than empty space decorated with confidence.
None of these is magic. They are better than asking the model to “be careful”, which remains one of the great folk rituals of modern computing.
Attribution becomes a landscape, not a receipt
The paper also derives a closed-form attribution and label-sensitivity structure for the Hopf-Cole predictor. The attribution weights are Gibbs weights: each support point contributes fractionally to the prediction. The label sensitivity can be computed without Hessian inversion, unlike classical influence-function approaches that ask how reweighting a training point changes the learned parameters.
This distinction matters. The paper’s attribution result answers a fixed-parameter question: given the learned support and weights, how does a label perturbation affect this prediction? It does not automatically answer the full retraining question: how would the entire trained model change if this data point had been upweighted or removed?
The attribution entropy analysis is more geometric. At small $\varepsilon$, attribution basins form around support points. As $\varepsilon$ increases, local minima and saddles collide through fold bifurcations, merging attribution basins. The numerical phase diagrams show this behaviour in synthetic two-cluster data and projected MNIST examples.
For operators, this is useful because it gives a vocabulary for attribution instability. Some explanations change abruptly not because the explainer is badly behaved, but because the model’s attribution landscape crosses a basin boundary. Near those transition regions, small input changes can switch the dominant evidence source. The system may still be mathematically consistent. It is just not narratively stable. Humans, regrettably, often prefer the latter.
What should change in model reviews
This paper should not cause every organisation to replace its model cards with PDE diagrams. It should change the questions asked in serious model reviews.
A model-risk review can use the framework as a checklist of coupled controls:
| Review question | Why it follows from the paper | Practical evidence to request |
|---|---|---|
| What is the effective temperature or softmax sharpness in the relevant components? | $\varepsilon$ controls smoothness, curvature, and selection | Temperature settings, calibration curves, entropy profiles |
| How dense is support coverage in high-risk input regions? | Width and support points approximate the data measure | Coverage tests, OOD probes, rare-case evaluation |
| Does the architecture encode useful invariances? | Inductive bias can reduce effective intrinsic dimension | Ablations against task symmetries and domain constraints |
| Where are attribution basin boundaries? | Attribution can bifurcate as viscosity changes | Explanation stability maps near decision boundaries |
| Are robustness claims representation-specific? | Hessian bounds may hold after projection or within a defined layer space | Raw-input and learned-representation sensitivity tests |
| Does fine-tuning preserve old support points? | Continual learning can displace measure coverage | Forgetting tests and support-preservation diagnostics |
The deeper lesson is that AI assurance should stop treating temperature, robustness, interpretability, scaling, and architecture as separate panels in a governance dashboard. They interact. The paper gives a mathematical language for that interaction.
The exactness boundary is where the article should end, not where it should panic
The paper is ambitious, and ambition always creates a larger target.
The first boundary is architectural. The exact Hopf-Cole identity is for LSE-activated feedforward layers and quadratic or anisotropic-quadratic Hamiltonians. ReLU and Softplus fit as special or limiting cases. Sigmoid and tanh have exact component-level interpretations. GELU and SiLU have structural connections but no exact HJ identity in the paper. Standard production transformer stacks mix exact, structural, and unresolved pieces.
The second boundary is training dynamics. The paper characterises what a trained or fixed-parameter LSE network computes. It gives mean-field support for training as selection over initial-data measures and fixed-support convexity or NTK-style results in narrower regimes. But practical deep-network SGD, finite width, mini-batches, and non-Gaussian data are not fully characterised.
The third boundary is dimensionality. The approximation rate is minimax-optimal for Lipschitz functions, which means the curse of dimensionality is not abolished. It is relocated. If useful data lies on a lower-dimensional manifold, the effective problem improves. If it does not, the theory does not politely pretend otherwise.
The fourth boundary is empirical scale. The experiments are well aligned with the paper’s theoretical claims, but they are not a production validation study. They verify identities, bounds, convergence behaviours, and exploratory attribution structures. They do not prove that large commercial models become robust, aligned, or cheap by invoking $\varepsilon$.
This is not a weakness. It is an unusually clear map of where the exact mathematics stops and where engineering begins. Given the usual state of AI theory discourse, that is almost refreshing.
Conclusion: the useful theory is the one that links the knobs
The paper’s most valuable contribution is not that it gives deep learning a grand physical metaphor. Grand metaphors are cheap; physics metaphors are sold in bulk.
The value is that it identifies a shared mechanism. In log-sum-exp networks, $\varepsilon$ is not merely a temperature. It is the bridge between smooth neural computation, PDE viscosity, entropy-regularised optimisation, and tropical hard selection. From that bridge, the paper derives consequences for robustness, attribution, scaling, backpropagation, and architecture.
For businesses, the takeaway is operational discipline. Do not ask whether a model is “robust” in isolation. Ask what viscosity regime it operates in. Do not ask whether a model is “large enough” in isolation. Ask how width interacts with the intrinsic dimension of the data support. Do not ask whether attention is “interpretable” in isolation. Ask when its Gibbs measure concentrates, when basins merge, and when the system is merely extrapolating from a dominant support point.
The paper does not make modern AI simple. It makes several of its trade-offs less independent than they looked. That is useful, because independent-looking knobs are exactly how complex systems become expensive surprises.
Cognaptus: Automate the Present, Incubate the Future.
-
Jose Marie Antonio Miñoza, Erika Fille T. Legara, and Christopher P. Monterola, “The Hamilton–Jacobi Theory of Deep Learning,” arXiv:2605.28983v1, 27 May 2026, https://arxiv.org/abs/2605.28983. ↩︎