The Likelihood Illusion: When Gaussian Comfort Meets Reality

Confidence is cheap. Calibration is expensive.

That is the uncomfortable lesson behind a new arXiv paper on earthquake source inversion, a domain that sounds safely remote until one notices the pattern: a complex physical simulator, uncertain model inputs, high-dimensional observations, and a decision-maker who wants a probability distribution rather than a shrug.¹ Replace “earthquake waveform” with “financial stress scenario,” “robot sensor stream,” “industrial digital twin,” or “clinical simulator,” and the problem becomes less geological and more familiar.

The paper studies full-waveform moment tensor inversion. In plain terms, seismologists observe seismic waves and infer what kind of earthquake source produced them: its mechanism, orientation, magnitude, and whether the source looks like a standard double-couple tectonic event or has non-double-couple components such as isotropic or CLVD terms. The inference is hard because the waveforms depend not only on the source but also on the Earth structure through which the waves travelled. If the Earth model is slightly wrong, the source estimate can become wrong too.

The familiar solution is Bayesian: include Earth-structure uncertainty in the likelihood, estimate a covariance matrix for the theory error, and produce a posterior. Very civilized. Very Gaussian. Very comforting.

The paper’s point is that this comfort is partly theatrical. Even minor 1-D velocity-model uncertainty, around the 1–3% range stated in the abstract, can break the Gaussian approximation used to represent theory errors. The result is not just a messy residual plot. It is biased and overconfident posterior inference. The posterior contours look precise, but precision is not the same as truth. A narrow wrong answer is still wrong; it merely has better typography.

The useful way to read the paper is not as “SBI beats Gaussian likelihood.” That is too blunt. The paper compares three inference designs, each with a different operational bargain:

Inference design	What it tries to buy	What it quietly risks
Gaussian likelihood with theory covariance	Tractability, interpretability, compatibility with established Bayesian inversion	Misspecified uncertainty, biased estimates, under-reported posterior width
Score-compression SBI	A lightweight bridge from physics-based sensitivities to simulation-based posterior learning	Dependence on a reasonably accurate local Gaussian estimate
Deep-learning SBI	Flexible compression of full waveform data and implicit marginalization over nuisance uncertainty	Higher training cost, architecture dependence, out-of-distribution risk

That comparison is where the business lesson lives. The paper is less about earthquakes than about uncertainty engineering: when should an organization trust a handcrafted likelihood, when should it add a simulation-trained correction layer, and when should it pay the upfront cost of a reusable probabilistic model?

The Gaussian baseline is elegant because it throws away the difficult part

The standard likelihood-based workflow begins from a reasonable objective. We want the posterior over source parameters while accounting for uncertain Earth structure. In the full problem, this means integrating over Earth-model parameters. That is conceptually clean but computationally painful, because every proposed Earth model may require new Green’s functions or equivalent forward modelling. The exact Bayesian version is not impossible in principle; it is simply unpleasant enough that practitioners search for approximations. This is how many legacy modelling pipelines are born: not from stupidity, but from bills.

The common approximation treats theory error as an additive Gaussian contribution to the data covariance. In the paper’s setup, the total covariance is split into data noise and theory error. The theory covariance can be estimated by Monte Carlo sampling of Earth-model perturbations and observing how the synthetic waveforms vary.

This sounds disciplined. It is disciplined. It is also restrictive.

To reach the Gaussian likelihood, several things are being assumed or softened into convenience. Perturbations to Earth structure are treated as small and locally linear effects on the forward model. Priors and errors are pushed toward Gaussianity. The covariance is local to a source-parameter estimate. Data noise and theory error are assumed independent and additive. For computational tractability, the Gaussian implementation also models per-station-component covariance blocks and ignores cross-component and cross-station covariance, although velocity-model perturbations can generate exactly those correlations.

This is the first important correction to the usual “Bayesian means honest uncertainty” story. Bayesian machinery does not rescue a bad likelihood. It can make the bad likelihood extremely coherent.

The authors are careful here. They do not claim that Gaussian likelihood methods are useless. They use the Gaussian approach as a benchmark, and in some synthetic cases it gives a reasonable focal mechanism. The problem is subtler: the method can appear precise exactly when its error model is least entitled to that precision.

The first diagnostic asks whether the covariance deserves its own confidence

Before comparing full inversions, the paper asks a simpler question: does the Gaussian covariance actually describe the variability produced by Earth-structure perturbations?

The authors use a goodness-of-fit diagnostic based on the statistic that should follow a chi-square distribution if the Gaussian covariance approximation is valid. They compare the empirical distribution from simulated observations against the theoretical expectation, including visual checks and the Kolmogorov–Smirnov statistic.

This test is not the headline result, but it matters because it separates two arguments that are often lazily merged:

Test	Likely purpose	What it supports	What it does not prove
Covariance goodness-of-fit	Diagnostic of the Gaussian theory-error model	Minor Earth-model uncertainty can induce non-Gaussian waveform variability	That every Gaussian inversion will fail in every practical case
Synthetic posterior coverage	Main evidence for inference reliability	Gaussian posteriors become overconfident and biased; SBI is better calibrated	That SBI is automatically safe under arbitrary real-world misspecification
Short-period and balanced-network variants	Robustness and sensitivity tests	Higher-frequency data worsens mismatch; better station coverage does not fix a bad likelihood	That station coverage is unimportant for precision
Shallow isotropic source tests	Challenging edge-case extension	ML-based SBI better preserves isotropic components in tested synthetic cases	That the method is fully validated for operational non-proliferation use
Two real earthquakes	Real-data demonstration	SBI can be applied to established moderate-magnitude events	That all unmodelled 3-D and site effects are solved

The diagnostic shows a familiar failure pattern: the covariance captures some central tendency but struggles with rare traces, phase shifts, amplitude anomalies, heavy tails, and higher-order structure. A covariance matrix is a second-order summary. Waveform errors caused by structural perturbations are not obligated to behave like polite second-order citizens.

This is the moment where the paper becomes relevant to AI systems beyond seismology. Many enterprise models contain an equivalent move: compress a complex uncertainty process into a covariance, a normal residual, a confidence interval, or a scenario band. The approximation may be defensible. But it should be tested as an object of suspicion, not inherited as office furniture.

Score-compression SBI is the careful compromise

The first SBI alternative is not a giant neural network dropped onto waveforms like a meteor. It is a more conservative bridge.

The score-compression approach uses physics-based sensitivity information to compress high-dimensional waveform observations into lower-dimensional summaries. The intuition is sensible: rather than feed every waveform sample into a density estimator, project the residuals onto directions that are informative about the source parameters. Then train a neural density estimator, in this case a normalizing-flow-based model, to estimate the posterior from those compressed summaries.

Operationally, this is attractive because it keeps much of the classical modelling logic. It is lightweight, comparatively interpretable, and cheaper than training a deep waveform encoder. It also performed much better than the Gaussian likelihood in calibration tests.

But it is not magic. The paper’s score-compression implementation still relies on a reasonably accurate local Gaussian estimate, including for prior truncation and sensitivity-based compression. In one synthetic failure case, both the Gaussian likelihood and score-compression SBI produce inconsistent solutions because the local estimate is badly biased. The bridge is useful, but one end of the bridge still rests on Gaussian ground. When that ground moves, the bridge may develop a personality.

This makes score-compression SBI an important middle category for business interpretation. Not every organization should jump directly from handcrafted likelihoods to expensive end-to-end deep probabilistic systems. A physics-informed or domain-informed compression layer can be the practical first step. It can improve calibration without demanding that the company rebuild its modelling stack around a new architecture.

The boundary is equally important: if the dominant errors are strongly nonlinear, nonlocal, or tied to nuisance variables that the old likelihood cannot represent, a lightweight bridge may inherit too much of the old failure mode.

Deep-learning SBI buys flexibility by moving cost upfront

The second SBI framework is more ambitious. It learns the compression directly from full-waveform, multi-station observations.

The architecture combines a shared per-station CNN for temporal waveform features, station and time embeddings, an axial transformer for aggregation across stations and time, learned query tokens for pooling, and a neural spline flow for posterior density estimation. The model is trained end-to-end under the neural posterior estimation objective. In less ceremonial language: the network learns which waveform features matter for source inference and then maps them to a posterior distribution.

The business trade-off is straightforward. This path is expensive to build and train, but cheap to reuse.

In the paper’s implementation, dataset generation for the deep-learning approach took about 10 minutes, and training took about 12 hours on a single NVIDIA RTX A6000 GPU. Once trained, the model was globally applicable across moment tensor parameters for the source-receiver configuration. By contrast, the Gaussian and score-compression workflows require local covariance estimation and inference overhead to be repeated for events. For the 300-inversion coverage tests, those classical-style workflows took 3–4 days on 30 CPUs per experimental configuration and required far more forward model evaluations. The trained deep model could perform posterior inference on hundreds of events within seconds.

That is not merely a computational footnote. It changes the economics of inference.

Cost type	Gaussian likelihood	Score-compression SBI	Deep-learning SBI
Upfront modelling cost	Moderate	Moderate	High
Per-event recomputation	High	High to moderate	Low once trained
Treatment of nonlinear waveform effects	Limited by covariance approximation	Better, but still partly tied to local Gaussian estimates	More flexible
Calibration in the main synthetic tests	Overconfident and biased	Much better, with occasional bias	Reliable, generally conservative in tested settings
Deployment concern	False precision	Inherited local-likelihood bias	Out-of-distribution behaviour and validation burden

This is the pattern Cognaptus sees repeatedly in AI automation: the strategic question is not “Which model is cheaper?” It is “Where should cost live?” Per-decision recomputation feels safer because the pipeline is familiar. Pre-trained probabilistic inference feels expensive because the cost is visible. But if decisions repeat at catalogue scale, fleet scale, transaction scale, or monitoring scale, upfront training can become the cheaper form of discipline.

Still, deep-learning SBI is not a free lunch. It is a more elaborate lunch with a GPU invoice and a validation appendix.

The synthetic inversions show the difference between tight and calibrated

The main evidence comes from repeated synthetic inversions where the true source parameters are known. This is the cleanest place to judge whether a posterior deserves trust.

The authors compare three methods: Gaussian-likelihood MCMC, score-compression SBI, and deep-learning compression SBI. In an illustrative benign case, the Gaussian likelihood gives posterior contours consistent with the artificial truth and has the narrowest uncertainty. A lazy reading would score that as a win.

The repeated tests say otherwise.

Across 400 artificial moment tensor inversions under moderate Earth-structure uncertainty, the authors use TARP coverage testing to check calibration. The result: only the SBI approaches produce posteriors broadly consistent with the true solutions. The Gaussian likelihood produces posteriors that are severely overconfident and mildly biased. In Table 1, the Gaussian approach often reports smaller posterior standard deviations than the SBI methods, but that tightness is precisely the problem when coverage fails. A small uncertainty number is not a trophy. Sometimes it is just a polished liability.

The deep-learning approach extracts more information than score compression, producing tighter contours than the score-compression SBI while still remaining reliable in the tested setting. The score-compression approach is much better calibrated than the Gaussian baseline but retains occasional bias, plausibly because of its dependence on local Gaussian estimates and prior truncation.

The distinction matters. The paper is not saying “wider posteriors are better.” It is saying “posterior width must be earned by calibration.” In a business model, the equivalent is not whether a risk engine produces a narrow confidence band. The question is whether that band contains reality at the promised frequency.

The sensitivity tests identify when comfort becomes dangerous

The paper then varies the experimental conditions to see where the Gaussian approximation degrades.

First, the authors vary Earth-structure uncertainty. Even under mild uncertainty, the Gaussian likelihood significantly underestimates moment tensor component uncertainty by around 30%. As velocity-structure uncertainty increases, the pathology worsens and bias persists. The exact lesson is not that 30% is a universal number. The lesson is that small simulator misspecification can become large posterior miscalibration when the likelihood form is wrong.

Second, the authors test shorter-period waveform data, filtering between 6 and 50 seconds rather than the baseline 20 to 50 seconds. Shorter periods are more sensitive to smaller-scale structure, so the theory errors become more nonlinear and less Gaussian. The Gaussian likelihood degrades further. Score-compression SBI also suffers more bias under this condition, while the ML-based compression remains more reliable in most tested cases, though its performance also degrades. This is a useful result because it refuses the cheap fairy tale that SBI removes all modelling pain. It reduces one class of pain and exposes the next.

Third, the authors test a balanced receiver network with better azimuthal coverage. Better station coverage improves precision, as expected. But it does not fix a poorly specified likelihood. The Gaussian method remains overconfident by a similar amount to the baseline. This is a subtle but important operational result: better data collection can improve signal, but it cannot automatically repair a broken uncertainty model. More sensors do not necessarily cure a bad likelihood. They may simply feed it more material with which to be confidently wrong.

The shallow isotropic tests are an edge case with business-style consequences

The shallow isotropic source tests are not the paper’s broadest validation claim. They are a targeted edge case. That makes them more useful, not less.

Shallow isotropic sources are difficult because different physical mechanisms can produce similar regional surface waves. The paper tests 10 randomly sampled highly isotropic synthetic sources at 500 m depth, using the same station configuration as the Long Valley Caldera event and Earth-structure uncertainty. In several cases, both the Gaussian likelihood and score-compression SBI substantially underestimate the isotropic component. The ML-based compression approach more faithfully recovers the isotropic component in the tested examples.

This matters because edge cases are where uncertainty systems reveal their true personality. In many applied settings, the average case is not the expensive case. The expensive case is the rare diagnosis, the unusual transaction, the anomalous sensor reading, the tail event, the source type with ambiguous signatures. A model that is “usually fine” but systematically overconfident on the very cases that trigger intervention is not operationally fine. It is a governance meeting with charts.

The authors are appropriately cautious. The isotropic tests are preliminary and synthetic. Prior-volume effects may contribute to the tendency to underestimate isotropic components when uncertainty is poorly modelled. The business translation should therefore be precise: the result supports the value of flexible uncertainty modelling in ambiguous-source settings; it does not certify deep SBI as ready for every high-stakes detection workflow.

The real earthquakes show applicability, not final victory

The paper applies the methods to two real events: the 1997 Long Valley Caldera volcanic earthquake and the 2020 Zagreb earthquake.

For the Long Valley event, the Gaussian likelihood and score-compression SBI agree closely on the maximum a posteriori focal mechanism. The important difference is uncertainty: the SBI approach gives substantially wider posterior uncertainty, which the synthetic results suggest is a more realistic characterization. The ML-based approach gives a slightly different solution with a fractionally higher isotropic component. The authors report that this higher isotropic component persists across ML training runs and even when time shifts are manually corrected, which makes it worth noting but not over-interpreting.

For the Zagreb event, the Gaussian likelihood and score-compression SBI broadly agree with prior work on a double-couple mechanism, though there are minor differences in orientation and magnitude. The ML-based compression approach remains broadly compatible but shows evidence of a minor CLVD component and different strike and rake orientations. The authors explicitly caution that unmodelled effects may introduce bias, and the posterior predictive checks appear under-dispersive, meaning the model’s simulated posterior predictions do not fully cover the observations. Translation: the model is doing something interesting, but reality is still larger than the simulator. Annoying, as reality tends to be.

These real-data results should be read as application demonstrations, not as deployment closure. They show that SBI can be applied to well-studied moderate-magnitude earthquakes and can produce plausible posterior predictive checks. They also show the central remaining issue: if the simulator fails to capture important 3-D structure, site response, or other contaminants, the learned posterior will inherit that blind spot.

Simulation-based inference is only as honest as the simulation campaign that trains it.

The business lesson is not “use deep learning”; it is “audit the likelihood”

The easiest misuse of this paper would be to turn it into an AI sales slide: old statistics bad, neural methods good. Convenient, punchy, and not quite true.

The better interpretation is a diagnostic framework for uncertainty-bearing systems.

Business system pattern	Paper analogue	Practical question
Digital twin with uncertain physical parameters	Earth-structure uncertainty	Are uncertainty inputs being marginalized or merely patched into a covariance?
Risk engine with scenario simulations	Synthetic waveform generation	Do simulated residuals match the assumed likelihood family?
AI agent making sequential decisions	Moment tensor posterior decisions	Are confidence estimates calibrated under repeated trials?
Monitoring system with many repeated events	Earthquake catalogue inversion	Is cost better paid per event or upfront through amortized inference?
Rare but high-impact edge cases	Shallow isotropic synthetic sources	Does the uncertainty model fail exactly where interpretation is hardest?

Cognaptus inference: the paper points toward a more mature uncertainty workflow for scientific AI and business automation.

First, treat likelihood design as a product component, not a mathematical afterthought. The likelihood defines what the system is allowed to believe about errors. If that object is misspecified, every downstream posterior inherits the distortion.

Second, stress-test uncertainty models through simulation before trusting their confidence. It is not enough to inspect point estimates. Repeated simulation with known truth allows calibration testing: does the posterior cover reality as often as it claims?

Third, separate precision from calibration. A narrow confidence interval, a tight posterior, or a decisive AI recommendation may be evidence of good information. It may also be evidence of an error model with excellent self-esteem.

Fourth, choose the inference architecture according to the operating regime. For one-off analyses, a classical likelihood with careful diagnostics may be enough. For repeated decisions under complex uncertainty, amortized SBI can move cost from repeated inference to training. For edge cases with nonlinear observational signatures, a learned compression architecture may be worth the validation burden.

Where this result stops

The paper’s boundary is clear enough if one does not try to make it more glamorous than it is.

The strongest evidence is synthetic and focused on moment tensor inversion under mostly 1-D Earth-structure uncertainty. The authors simplify azimuthal variations through station-specific time shifts in the real-data setting. They explicitly identify 3-D Earth structure, anisotropy, site response, source time functions, source-location uncertainty, receiver uncertainty, and forward-modelling error as important future directions.

The score-compression SBI method is attractive because it is lightweight and physically informed, but it remains partly tied to the Gaussian likelihood machinery. When local Gaussian estimates are biased, the method can inherit that bias.

The deep-learning SBI method is more flexible and efficient once trained, but its reliability under unmodelled real-world effects remains an open validation problem. A neural posterior estimator trained on an incomplete simulator does not become omniscient. It becomes very good at the world it was shown. We have a name for systems that are confident outside their training world: expensive surprises.

Finally, neural posterior estimation bakes in the prior used during dataset generation. The authors note that neural likelihood estimation could offer a different route when one wants to probe alternative priors or combine independent measurements. That point matters for business systems too: the more reusable the learned inference layer becomes, the more carefully one must document the assumptions embedded in its training distribution.

From Gaussian comfort to calibrated discomfort

The paper’s contribution is not merely a new seismic inversion method. It is a compact lesson in probabilistic humility.

The Gaussian likelihood is attractive because it turns a hard marginalization problem into a tractable covariance problem. Sometimes that is enough. But the paper shows that even minor Earth-model uncertainty can create non-Gaussian theory errors, and that those errors can translate into biased, overconfident moment tensor posteriors. The posterior looks official. The uncertainty is printed. The calibration, however, has left the building.

Score-compression SBI offers a pragmatic compromise: use physics to compress the waveform information, then learn the posterior empirically. It improves calibration substantially but can still inherit failure from local Gaussian estimates.

Deep-learning SBI offers the more flexible path: learn the waveform representation and posterior together, pay the training cost upfront, and amortize inference across many events. It performs well in the paper’s synthetic tests and is promising for catalogue-scale Bayesian inversion, but it requires serious validation against richer real-world uncertainty.

For business readers, the lesson is simple but not comfortable: uncertainty is not something to decorate a prediction with after the model is built. It is part of the model. If the uncertainty layer is wrong, the system’s confidence becomes a liability.

The future of scientific AI, digital twins, and agentic decision systems will not be won by models that merely predict. It will be won by systems that know how wrong they might be, and can prove it more convincingly than a bell curve drawn out of habit.

Cognaptus: Automate the Present, Incubate the Future.

A. A. Saoulis, T.-S. Phạm, and A. M. G. Ferreira, “Improving moment tensor solutions under Earth structure uncertainty with simulation-based inference,” arXiv:2603.18925, submitted March 19, 2026. https://arxiv.org/abs/2603.18925 ↩︎

The Gaussian baseline is elegant because it throws away the difficult part#

The first diagnostic asks whether the covariance deserves its own confidence#

Score-compression SBI is the careful compromise#

Deep-learning SBI buys flexibility by moving cost upfront#

The synthetic inversions show the difference between tight and calibrated#

The sensitivity tests identify when comfort becomes dangerous#

The shallow isotropic tests are an edge case with business-style consequences#

The real earthquakes show applicability, not final victory#

The business lesson is not “use deep learning”; it is “audit the likelihood”#

Where this result stops#

From Gaussian comfort to calibrated discomfort#