Uncertainty Quantification

Coverage Is Not a Range: RPS for Ordinal Conformal Prediction

TL;DR for operators A risk system for ordered outcomes should return a coherent low-to-high range, not merely a collection of plausible labels. Coverage matters, but it does not distinguish an adjacent miss from one several severity levels away. A narrow range can therefore look efficient while hiding a rare, costly outcome. The method examined here first attaches a calibrated set of plausible labels to each prediction, then evaluates ordered errors through cumulative probability across the label scale. Its ranked probability score produces candidate-label scores that fall toward a predictive median and rise afterward. Thresholding this V-shaped sequence yields nested, median-centered contiguous ranges, even when the classifier’s class probabilities are not unimodal. ...

The Model Got Smaller. The Risk Got Wider.

TL;DR for operators Compression is usually sold as a clean engineering bargain: smaller model, lower memory, cheaper inference, acceptable accuracy loss. This paper asks the more operationally annoying question: after compression, does the model still know when it should hedge? The answer is: not reliably. Tong et al. benchmark compressed LLMs using conformal prediction, a framework that converts model probabilities into prediction sets with target coverage.1 In this setup, the important uncertainty metric is prediction set size: if the model needs to include more answer options to maintain coverage, it is less certain, even if its top-1 accuracy still looks respectable. ...

The Model Agreed With Itself. That Was the Problem.

TL;DR for operators A model giving the same answer five times is comforting in the same way that five interns copying the same spreadsheet error is comforting: technically consistent, operationally useless. The paper behind this article proposes structural uncertainty, a black-box method for evaluating whether an LLM can stably rank its own reasoning paths, not merely whether its final answers agree.1 The method samples multiple candidate solutions, asks the same model to compare pairs of its own outputs, turns those comparisons into ranking distributions using Bradley-Terry or TrueSkill plus PageRank, then measures two things: whether rankings fluctuate across comparison trials, and whether each trial remains ambiguous among candidates. ...

Raw Is Not Ready: Why Reliable AI Needs Evidence Architecture

Raw Is Not Ready: Why Reliable AI Needs Evidence Architecture Production AI has entered its awkward teenage phase. It can speak fluently, see impressively, forecast usefully, and still fail in ways that make operators quietly reach for the manual override. The problem is not simply that models are too small, not enough tokens have been burned, or someone forgot to add “think step by step” to a prompt. The deeper problem is that many AI systems are being asked to reason directly from raw inputs that have not yet been converted into the right operational form. ...

The Price of Explanation: When AI Should Stay Silent

Explanation is not free. That sounds obvious until one watches an AI system in production. A model predicts. A user asks why. The platform dutifully runs SHAP, LIME, saliency maps, or some carefully branded interpretability module, then presents a ranked list of “important” features with the solemn confidence of a consultant who has just discovered a bar chart. ...

Calibrated Confidence: When AI Learns to Doubt Itself (Just Enough)

A doctor does not need an assistant that sounds certain all the time. That is just an intern with better typography. What the doctor needs is narrower and more useful: an assistant that knows when its answer deserves a second look. In high-stakes work, the confidence attached to an answer is not decoration. It is workflow metadata. It tells the system whether to proceed, pause, escalate, or ask someone with a license and malpractice insurance. ...

The Likelihood Illusion: When Gaussian Comfort Meets Reality

Confidence is cheap. Calibration is expensive. That is the uncomfortable lesson behind a new arXiv paper on earthquake source inversion, a domain that sounds safely remote until one notices the pattern: a complex physical simulator, uncertain model inputs, high-dimensional observations, and a decision-maker who wants a probability distribution rather than a shrug.1 Replace “earthquake waveform” with “financial stress scenario,” “robot sensor stream,” “industrial digital twin,” or “clinical simulator,” and the problem becomes less geological and more familiar. ...

Confidence Gates: When AI Should Know Enough to Say 'I Don't Know'

Traffic. That is the easiest way to understand confidence gates. A recommender system ranks products. An ad system ranks bids. A clinical triage system ranks cases. A fraud model ranks transactions. Somewhere inside the pipeline, someone asks the apparently sensible question: Should the system act on this prediction, or should it step back? ...

Attention with Doubt: Teaching Transformers When Not to Trust Themselves

Confidence is cheap. A classifier can always give you a probability. The awkward question is whether that probability deserves to be believed. This is not a philosophical problem when the model is recommending a movie. It becomes expensive when the model is screening documents, triaging support tickets, flagging fraud, routing legal clauses, or deciding whether a case should be escalated to a human. In those settings, “92% confident” is not decoration. It is an operating instruction. ...

Digging Deeper with Bayes: Why AI May Finally Fix Mineral Exploration

Drilling is where optimism receives an invoice. In mineral exploration, maps can look promising, models can look elegant, and geophysical anomalies can glow like destiny on a consultant’s slide deck. Then the drill rig arrives. A few expensive holes later, the anomaly turns out not to be an economic mineral system, the team moves to the next target, and everyone quietly files the failed interpretation under “learning.” Very scientific. Very costly. ...