The Viscosity Budget: Why Softmax Is Not Just a Knob

TL;DR for operators

A new paper by Jose Marie Antonio Miñoza, Erika Fille T. Legara, and Christopher P. Monterola argues that a log-sum-exp neural layer is not merely analogous to a viscous Hamilton-Jacobi equation. Under the paper’s parameterisation, it is exactly the Hopf-Cole solution of one, evaluated at the input point.¹

The operational point is not “neural networks are physics now”, although someone will certainly try to put that on a slide. The point is cleaner: one parameter, $\varepsilon$, simultaneously controls softmax temperature, PDE viscosity, and entropy-regularised convex optimisation. That makes smoothness, expressiveness, robustness, attribution sharpness, and scaling behaviour mathematically coupled.

For business readers, this reframes several model-design choices. Temperature is not just a calibration knob. Architecture is not just a leaderboard habit. Width is not just capacity. Each becomes part of a discretisation-and-regularisation budget: how finely the model covers the data measure, how sharply it selects local evidence, how much curvature it permits, and how badly it behaves when queried outside the support of its training distribution.

The paper’s numerical experiments mainly verify identities and consequences in controlled settings: machine-precision checks for the LSE/Hopf-Cole identity, quadrature convergence, scaling behaviour under synthetic and trained LSE settings, Hessian robustness bounds, and exploratory attribution phase diagrams. They are not evidence that today’s production transformers suddenly inherit guaranteed robustness because someone noticed Hamilton liked equations.

The boundary matters. The exact result applies most directly to log-sum-exp layers and quadratic or anisotropic-quadratic Hamiltonians. ResNets, recurrent models, state-space models, and transformer components are connected structurally, with specific exact pieces and specific gaps. Standard multi-head causal transformer stacks with GELU or SiLU activations are not fully swallowed by the theorem. Not everything with a softmax is secretly a solved PDE. Some of it is merely wearing a PDE-adjacent jacket.

The ordinary knob that becomes the whole mechanism

Most AI teams treat temperature as a behavioural control. Increase it and the model becomes softer, more exploratory, less brittle. Decrease it and the model becomes sharper, more decisive, occasionally too pleased with itself.

That engineering intuition is not wrong. It is just shallow.

The paper’s central move is to show that, for log-sum-exp layers, the same parameter that softens a neural computation also plays the role of viscosity in a Hamilton-Jacobi partial differential equation and the role of entropy regularisation in a convex optimisation problem. This is the mechanism from which the rest of the paper follows.

The familiar log-sum-exp layer has the form:

$$ f_\varepsilon(x) = \varepsilon \log \sum_i \exp\left(\frac{\langle w_i,x\rangle+b_i}{\varepsilon}\right). $$

At finite $\varepsilon$, all neurons contribute through softmax weights. As $\varepsilon \to 0$, the expression collapses toward a hard maximum. The paper places that familiar soft-to-hard transition inside a larger mathematical square:

Viewpoint	Finite $\varepsilon$	Limit as $\varepsilon \to 0$	Operational reading
Neural network	Log-sum-exp / softmax	Max / hard selection	Smooth ensemble becomes winner-take-all routing
Algebra	Ordinary smooth deformation	Tropical max-plus algebra	Arithmetic becomes selection geometry
PDE	Viscous Hamilton-Jacobi	Inviscid Hamilton-Jacobi	Diffusion-smoothed solution becomes shock-prone solution
Optimisation	Entropy-regularised convex programme	Linear programme at a vertex	Soft allocation becomes sparse vertex choice

This is why the paper is best read mechanism-first. The authors do not simply collect analogies between neural networks, PDEs, tropical algebra, and optimisation. They argue that these are four descriptions of the same object under a deformation controlled by $\varepsilon$.

That does not make the result easy. It makes it useful. Once $\varepsilon$ has four simultaneous meanings, several separate model-quality conversations stop being separate. Robustness, attribution, smoothness, scaling, and architecture choice start sharing a denominator. Annoying, but clarifying.

The exact claim is about LSE layers solving an initial-value problem

The strongest result is the paper’s exact identity for log-sum-exp layers. The authors show that an LSE layer can be reparameterised as the Hopf-Cole solution of a viscous Hamilton-Jacobi initial-value problem under a discrete measure.

Translated out of PDE dialect: the weights encode support points and initial data; the layer width discretises a measure; the input is the spatial point where the solution is evaluated; and the forward pass evaluates the resulting PDE solution.

This is not a physics-informed neural network story. PINNs impose a PDE through a residual loss and train a network to approximate the PDE solution. Here, for the LSE layer, the PDE is already in the layer’s algebra. The residual is not pushed down during training; the identity is exact by construction.

That difference matters.

A PINN says: “Please learn this equation.” This paper says: “You already are an equation. The question is which one.”

The authors describe training as a search over Hamilton-Jacobi initial-value problems. For any fixed trained LSE layer, the PDE interpretation is precise: the learned parameters define the initial data. The more delicate claim is about how stochastic gradient descent selects that initial-value problem. The paper gives mean-field and fixed-support results, but the full selection story for deep, finite-width, mini-batch-trained practical networks remains open. We will return to that boundary, because it is where many breathless interpretations should be quietly escorted out.

The tropical limit explains why sharp models become selectors

The limit $\varepsilon \to 0$ is not just “temperature gets low”. It is a structural phase change.

At finite $\varepsilon$, the LSE layer computes a Gibbs-weighted average. Multiple neurons influence the output. Attribution is distributed. Curvature is smoothed by viscosity.

As $\varepsilon$ shrinks, the Gibbs measure concentrates on the dominant term. The network behaves like a max-affine spline operator: one neuron wins in each region, and the input space is partitioned into polyhedral cells. In PDE terms, the viscous Hamilton-Jacobi equation approaches its inviscid Hopf-Lax form. In optimisation terms, the entropy-regularised programme collapses to a linear programme at a vertex.

This gives a disciplined explanation for a behaviour practitioners already observe. Sharp models can be expressive, but their decisions may hinge on abrupt local switches. Smooth models are less brittle, but they can over-diffuse distinctions that matter.

The paper formalises this trade-off through curvature bounds. In particular, it derives a Hessian bound showing that increasing $\varepsilon$ suppresses curvature and enlarges a certified adversarial radius. Decreasing $\varepsilon$ sharpens the model but narrows that radius. This is not a moral argument for smoothness. It is a control budget.

For operators, the question becomes less childish than “Should we maximise accuracy or robustness?” The better question is:

What viscosity budget should this system run under, given its risk class, data coverage, and acceptable loss of local discrimination?

That is the kind of question organisations can actually govern.

Architecture becomes a discretisation choice, not a fashion cycle

The paper’s second major contribution is to show how several common architectures fit into the same Hamilton-Jacobi frame. The level of exactness varies, and that variation is the whole story.

Feedforward LSE layers have the cleanest result: exact Hopf-Cole identity under a discrete measure. A deep LSE network corresponds to repeated application of the PDE semigroup, with finite-depth and finite-width approximation errors quantified in the appendices. The composition is not simply hand-waved; the paper gives finite-depth error and joint-limit exactness results.

ResNets enter through a different route. A residual layer is interpreted as an Euler discretisation of an ODE characteristic of a Hamilton-Jacobi PDE. The forward pass follows the characteristic dynamics. Backpropagation becomes the co-state equation, integrated backward in time. In control language, the update aligns with the Pontryagin Maximum Principle. In less ceremonial terms: reverse-mode differentiation is not a mystical afterthought; it is the adjoint side of the same dynamical system.

Transformers are more mixed. Scaled dot-product attention is read as an expected value under a Gibbs distribution. L2 attention has a cleaner exact Hopf-Cole interpretation. An LSE-activated transformer feed-forward sublayer satisfies the exact LSE identity. Residuals are structural. LayerNorm does not break the attention identity, but it defines the inputs to it. Standard transformer blocks with GELU or SiLU feed-forward activations are not fully exact HJ solvers. Multi-head attention and causal masking introduce additional complications.

Recurrent architectures and state-space models fit structurally as characteristic discretisations. RNNs become unit-step Euler schemes. Linear SSMs become linear Hamilton-Jacobi characteristic systems. LSTMs are described through gated, time-dependent Hamiltonians, with gates acting as gradients of two-neuron LSE components and becoming selectors in the tropical limit.

The practical interpretation is compact:

Architecture class	Paper’s reading	Strength of claim	Business consequence
LSE feedforward layers	Exact Hopf-Cole solution under discrete measure	Exact for quadratic or anisotropic-quadratic Hamiltonians	Temperature, width, and robustness can be analysed as linked controls
Deep LSE networks	Composition approximates PDE semigroup	Quantified finite-depth and width error; exact in joint limit	Depth and width enter an approximation budget
ResNets	Euler discretisation of HJ characteristics	Structural, with exact adjoint/backprop interpretation in stated setting	Architecture choice becomes numerical integration choice
Attention	Gibbs expectation; L2 attention has exact Hopf-Cole form	Exact for specific attention identities; broader transformer stack mixed	Attention sharpness and sinks can be read as measure concentration
RNNs, LSTMs, SSMs	Characteristic discretisations with architecture-specific viscosity	Structural correspondences	Sequence models differ by how they inject and propagate viscosity
GELU / SiLU networks	Related finite-temperature structures	Not exact under current identity	Do not overclaim PDE guarantees for standard production activations

This table is where the misconception should die. The paper is not saying every modern network literally solves an arbitrary Hamilton-Jacobi PDE exactly. It says one class does, several components align strongly, and broader architectures inherit structural correspondences with clear unresolved gaps.

The difference is not pedantry. It is the difference between a theory and a consultancy deck.

What the experiments actually support

The numerical section is not a benchmark race. It is mostly a verification suite.

That is appropriate. A paper making exact mathematical claims should first show that the identities survive implementation, that the rates behave as predicted, and that derived bounds are not immediately contradicted by computation.

The experiments have different evidentiary roles:

Test or figure	Likely purpose	What it supports	What it does not prove
LSE/Hopf-Cole identity checks across $\varepsilon$	Main identity verification	The algebraic identity holds to machine precision in numerical implementation	That arbitrary non-LSE production networks inherit the identity
Transformer attention identity checks	Implementation-detail verification for attention correspondence	The stated attention identity holds in random trials and LSE-transformer block checks	That full causal multi-head transformers are exact HJ solvers
Quadrature convergence	Main evidence for approximation-rate behaviour	Width improves the PDE approximation as predicted	That real-world model scaling is fully explained by the bound
Scaling-law experiments on synthetic and Adam-trained LSE networks	Main evidence plus controlled extension	The predicted width exponent appears in controlled LSE settings	That internet-scale LLM scaling follows the formula without confounds
Hessian and adversarial robustness checks	Robustness and sensitivity test	The curvature bound is not violated in closed-form, trained, MNIST, and CIFAR-10 projected settings	That raw-pixel or learned-representation robustness is solved
Attribution entropy phase diagrams	Exploratory extension	The entropy landscape undergoes fold-like basin-merging behaviour in synthetic and projected real-data settings	That attribution in all deployed models can be read directly from these landscapes
Intrinsic-dimension table from published scaling curves	Comparison with prior work and interpretive bridge	Empirical scaling exponents can be converted into rough implied intrinsic dimensions	That those inferred dimensions are certified properties of the data manifold

Two details are worth not missing.

First, the real-data robustness checks use PCA-projected MNIST and CIFAR-10 representations. That makes them useful as sensitivity evidence for the bound, but not a raw-input robustness certificate for production vision systems.

Second, the paper’s table converting published scaling exponents into implied intrinsic dimensions is explicitly directional. It estimates order-of-magnitude geometry from empirical loss-vs-parameter curves. The reported examples include under-trained GPT-scale language with exponent $0.076$ and implied dimension $13.2$, compute-optimal language with exponent $0.35$ and implied dimension $2.9$, video with exponent $0.24$ and implied dimension $4.2$, and math with exponent $0.38$ and implied dimension $2.6$. Those numbers are interesting because they connect scaling behaviour to data geometry. They are not licence to walk into a board meeting and declare that language has precisely 2.9 dimensions. Please do not.

The business value is coupled diagnosis, not mystical explanation

The commercial relevance of this paper is not that executives should learn Hamilton-Jacobi theory before approving a model deployment. Most of them have suffered enough.

The value is diagnostic. The framework gives operators a way to see several familiar AI trade-offs as one coupled design surface.

Paper result	Cognaptus business interpretation	Boundary
$\varepsilon$ is softmax temperature, PDE viscosity, and entropy regularisation	Treat temperature as a governance-relevant design variable, not post-hoc seasoning	Strongest for LSE and softmax-based components
Higher $\varepsilon$ suppresses curvature and improves certified perturbation radius	Robustness can be tuned through smoothness, with a measurable loss of sharp discrimination	Certificates are tied to the paper’s setting and representation assumptions
Width discretises the measure; approximation depends on intrinsic dimension	Compute planning should consider data geometry, not just parameter count	Intrinsic dimension estimates from scaling curves are rough and confounded
OOD behaviour becomes dominant-neuron extrapolation outside support	Alignment and safety require support coverage, OOD penalties, or viscosity control	This is a structural model of extrapolation, not a full theory of language hallucination
Backpropagation is an adjoint Hamiltonian flow for ResNets	Training algorithm design can borrow from numerical PDE and control methods	Convergence destination under practical SGD remains open
Attention sinks resemble dominant neurons in the tropical limit	Some attention pathologies can be read as measure concentration rather than mysterious emergent behaviour	Multi-head causal transformer stacks are not fully covered by the exact theorem

The result is especially relevant for AI governance and model-risk teams because it replaces vague behavioural discussion with structural levers. A system can be asked: How much curvature are we allowing? How dense is the support coverage? How does the chosen architecture discretise the relevant process? Where does the input fall relative to the learned measure? What happens as the model moves toward hard selection?

These questions do not guarantee safety. They make failure modes less literary.

Scaling laws become geometry claims wearing compute clothing

The paper’s scaling-law argument is one of its most business-relevant pieces, but it needs careful handling.

The derivation connects approximation error to quadrature over a data measure. If the data lies on or near a lower-dimensional manifold, the relevant dimension is not the ambient input dimension but the intrinsic dimension of the support. The width-only scaling exponent can then be interpreted through this intrinsic dimension.

This is a useful reframing of compute planning. It says the economic cost of performance improvement depends not only on how large the model is, but on how hard the data distribution is to cover. A broad, messy, high-dimensional support requires many more atoms. A constrained, structured domain needs fewer.

This is where small specialised models become more than a cost-saving slogan. If a domain has lower intrinsic dimension, specialisation is not merely cheaper because the model is smaller. It is cheaper because the approximation problem is geometrically easier.

The paper’s Appendix J and K translate this into design principles: estimate intrinsic dimension from pilot scaling curves, match $\varepsilon$ to width and dimension, use architectural inductive bias to reduce effective dimension, curate data to lower support complexity, and treat mixture-of-experts or distillation as ways of partitioning or compressing the learned measure.

Cognaptus inference: this supports a more disciplined alternative to brute-force scaling. Before buying more compute, ask whether the problem is actually high-dimensional, whether the architecture is wasting capacity on irrelevant symmetries, and whether the training corpus is expanding the support or just adding noisy repetitions to it.

Boundary: the paper does not prove a universal cost model for frontier training. Empirical scaling exponents conflate approximation, optimisation, data quality, and training allocation. The intrinsic-dimension reading is powerful, but it is not an invoice from geometry.

OOD behaviour is not mysterious when the measure runs out

The paper’s treatment of hallucination is deliberately structural. It defines an out-of-distribution extrapolation regime in which the input lies outside the diffusion radius of every support point. In that regime, the output becomes exponentially close to the linear extrapolation of the dominant neuron.

This is not a complete account of language-model hallucination. The authors are explicit that the term is used informally and that language-model phenomena include other mechanisms. Still, the result captures a critical operational fact: outside the learned support, the model is no longer governed by local evidence from the training measure. It is governed by whichever learned component dominates the extrapolation geometry.

That has business consequences.

A compliance model queried on an unseen regulatory scenario may produce an answer that is locally smooth, syntactically respectable, and structurally ungrounded. A medical triage model presented with an unusual patient profile may confidently extend a nearby learned rule. A finance model may interpolate beautifully until market structure changes, then discover its inner philosopher.

The framework suggests three levers:

Expand support coverage with diverse and targeted training data.
Use viscosity control to smooth sharp extrapolation and reduce sensitivity.
Add OOD penalties or synthetic boundary cases so risky regions become part of the training measure rather than empty space decorated with confidence.

None of these is magic. They are better than asking the model to “be careful”, which remains one of the great folk rituals of modern computing.

Attribution becomes a landscape, not a receipt

The paper also derives a closed-form attribution and label-sensitivity structure for the Hopf-Cole predictor. The attribution weights are Gibbs weights: each support point contributes fractionally to the prediction. The label sensitivity can be computed without Hessian inversion, unlike classical influence-function approaches that ask how reweighting a training point changes the learned parameters.

This distinction matters. The paper’s attribution result answers a fixed-parameter question: given the learned support and weights, how does a label perturbation affect this prediction? It does not automatically answer the full retraining question: how would the entire trained model change if this data point had been upweighted or removed?

The attribution entropy analysis is more geometric. At small $\varepsilon$, attribution basins form around support points. As $\varepsilon$ increases, local minima and saddles collide through fold bifurcations, merging attribution basins. The numerical phase diagrams show this behaviour in synthetic two-cluster data and projected MNIST examples.

For operators, this is useful because it gives a vocabulary for attribution instability. Some explanations change abruptly not because the explainer is badly behaved, but because the model’s attribution landscape crosses a basin boundary. Near those transition regions, small input changes can switch the dominant evidence source. The system may still be mathematically consistent. It is just not narratively stable. Humans, regrettably, often prefer the latter.

What should change in model reviews

This paper should not cause every organisation to replace its model cards with PDE diagrams. It should change the questions asked in serious model reviews.

A model-risk review can use the framework as a checklist of coupled controls:

Review question	Why it follows from the paper	Practical evidence to request
What is the effective temperature or softmax sharpness in the relevant components?	$\varepsilon$ controls smoothness, curvature, and selection	Temperature settings, calibration curves, entropy profiles
How dense is support coverage in high-risk input regions?	Width and support points approximate the data measure	Coverage tests, OOD probes, rare-case evaluation
Does the architecture encode useful invariances?	Inductive bias can reduce effective intrinsic dimension	Ablations against task symmetries and domain constraints
Where are attribution basin boundaries?	Attribution can bifurcate as viscosity changes	Explanation stability maps near decision boundaries
Are robustness claims representation-specific?	Hessian bounds may hold after projection or within a defined layer space	Raw-input and learned-representation sensitivity tests
Does fine-tuning preserve old support points?	Continual learning can displace measure coverage	Forgetting tests and support-preservation diagnostics

The deeper lesson is that AI assurance should stop treating temperature, robustness, interpretability, scaling, and architecture as separate panels in a governance dashboard. They interact. The paper gives a mathematical language for that interaction.

The exactness boundary is where the article should end, not where it should panic

The paper is ambitious, and ambition always creates a larger target.

The first boundary is architectural. The exact Hopf-Cole identity is for LSE-activated feedforward layers and quadratic or anisotropic-quadratic Hamiltonians. ReLU and Softplus fit as special or limiting cases. Sigmoid and tanh have exact component-level interpretations. GELU and SiLU have structural connections but no exact HJ identity in the paper. Standard production transformer stacks mix exact, structural, and unresolved pieces.

The second boundary is training dynamics. The paper characterises what a trained or fixed-parameter LSE network computes. It gives mean-field support for training as selection over initial-data measures and fixed-support convexity or NTK-style results in narrower regimes. But practical deep-network SGD, finite width, mini-batches, and non-Gaussian data are not fully characterised.

The third boundary is dimensionality. The approximation rate is minimax-optimal for Lipschitz functions, which means the curse of dimensionality is not abolished. It is relocated. If useful data lies on a lower-dimensional manifold, the effective problem improves. If it does not, the theory does not politely pretend otherwise.

The fourth boundary is empirical scale. The experiments are well aligned with the paper’s theoretical claims, but they are not a production validation study. They verify identities, bounds, convergence behaviours, and exploratory attribution structures. They do not prove that large commercial models become robust, aligned, or cheap by invoking $\varepsilon$.

This is not a weakness. It is an unusually clear map of where the exact mathematics stops and where engineering begins. Given the usual state of AI theory discourse, that is almost refreshing.

Conclusion: the useful theory is the one that links the knobs

The paper’s most valuable contribution is not that it gives deep learning a grand physical metaphor. Grand metaphors are cheap; physics metaphors are sold in bulk.

The value is that it identifies a shared mechanism. In log-sum-exp networks, $\varepsilon$ is not merely a temperature. It is the bridge between smooth neural computation, PDE viscosity, entropy-regularised optimisation, and tropical hard selection. From that bridge, the paper derives consequences for robustness, attribution, scaling, backpropagation, and architecture.

For businesses, the takeaway is operational discipline. Do not ask whether a model is “robust” in isolation. Ask what viscosity regime it operates in. Do not ask whether a model is “large enough” in isolation. Ask how width interacts with the intrinsic dimension of the data support. Do not ask whether attention is “interpretable” in isolation. Ask when its Gibbs measure concentrates, when basins merge, and when the system is merely extrapolating from a dominant support point.

The paper does not make modern AI simple. It makes several of its trade-offs less independent than they looked. That is useful, because independent-looking knobs are exactly how complex systems become expensive surprises.

Cognaptus: Automate the Present, Incubate the Future.

Jose Marie Antonio Miñoza, Erika Fille T. Legara, and Christopher P. Monterola, “The Hamilton–Jacobi Theory of Deep Learning,” arXiv:2605.28983v1, 27 May 2026, https://arxiv.org/abs/2605.28983. ↩︎

TL;DR for operators#

The ordinary knob that becomes the whole mechanism#

The exact claim is about LSE layers solving an initial-value problem#

The tropical limit explains why sharp models become selectors#

Architecture becomes a discretisation choice, not a fashion cycle#

What the experiments actually support#

The business value is coupled diagnosis, not mystical explanation#

Scaling laws become geometry claims wearing compute clothing#

OOD behaviour is not mysterious when the measure runs out#

Attribution becomes a landscape, not a receipt#

What should change in model reviews#

The exactness boundary is where the article should end, not where it should panic#

Conclusion: the useful theory is the one that links the knobs#