When RMSE Lies: Why Your AI Model Might Be Quietly Mispricing Risk

A forecast can be wrong in many ways.

It can miss by a little. It can miss by a lot. It can be accurate on average while quietly underestimating rare but expensive outcomes. It can give a beautifully low RMSE while assigning laughably thin probability to the event that later eats the budget. This is the sort of mistake that looks harmless in a dashboard and expensive in a board meeting.

That is the practical problem behind ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules, a new benchmark for evaluating tabular foundation models on distributional regression, not just point prediction.¹ The paper’s immediate contribution is technical: ScoringBench evaluates models across 97 regression datasets using proper scoring rules such as CRPS, CRLS, interval score, energy score, weighted CRPS, and log-like metrics, alongside familiar point metrics. But the business lesson is simpler and more irritating: the metric is not a reporting choice. It is a model-selection policy.

The usual workflow says: train models, compute RMSE or $R^2$, pick the winner, deploy, add a confidence interval if someone in governance asks for one. Very tidy. Also often wrong.

ScoringBench is useful because it shows why this workflow breaks down when models produce full predictive distributions. TabPFN and TabICL-style models are not merely producing a single number. They can output probability mass functions or quantile-based predictive distributions. Evaluating them only by RMSE is like buying a weather forecast system and grading it only on whether the afternoon temperature was close. The umbrella question has mysteriously disappeared.

The metric is the contract, not the decoration

A point metric answers a point question. RMSE asks whether the conditional mean is close to the observed outcome, with larger errors punished quadratically. MAE asks a slightly different question. $R^2$ answers another one again. These are not neutral summaries of “model quality.” They are utility functions wearing spreadsheet clothes.

Proper scoring rules make this explicit for probabilistic forecasts. A scoring rule evaluates a predicted distribution after the real outcome is observed. A strictly proper scoring rule has a desirable theoretical property: in expectation, the best thing for the model to report is the true distribution. That sounds reassuring, but it does not mean all proper scoring rules behave identically in practice. In finite samples, under limited training budgets, with imperfect model classes and noisy datasets, different scoring rules induce different practical preferences.

This is the mechanism the paper asks readers to notice.

If a scoring rule emphasizes the center of the predictive distribution, it will reward a different kind of model than a rule that penalizes tail behavior more heavily. If an interval score jointly penalizes wide intervals and missed coverage, it is not asking the same question as a coverage metric that only checks whether the realized target falls inside a nominal 90% band. If weighted CRPS emphasizes a region of the distribution, it is no longer pretending that all quantile errors matter equally.

That distinction matters because many business problems are not symmetric-error games.

Business setting	The cheap error	The expensive error	Metric implication
Credit risk	Slightly overestimating default risk	Underestimating default risk before losses cluster	Tail-sensitive distributional scoring matters
Inventory planning	Carrying modest extra stock	Stockout during demand spike	Upper-tail quantiles may matter more than average error
Insurance pricing	Slightly high premium estimate	Thin-tailed loss forecast	Calibration and tail behavior matter more than RMSE
Maintenance prediction	Early inspection	Missed failure window	Interval quality and lower-tail/upper-tail asymmetry matter
Revenue forecasting	Conservative forecast	Overconfident target that drives bad hiring or cash planning	Sharpness must be judged together with calibration

The table is not in the paper; it is the business translation. ScoringBench supplies the technical evidence that the choice of scoring rule changes which models look good. Cognaptus’ inference is that companies should stop treating evaluation metrics as after-the-fact dashboard labels. They are closer to procurement requirements.

What ScoringBench actually builds

The benchmark has three practical pieces.

First, it curates a broad tabular regression testbed. The authors draw from OpenML regression suites, PMLB, KEEL, and TALENT-linked datasets, then deduplicate and filter them. The final benchmark covers 97 regression datasets across domains such as finance, physics, engineering, environmental science, real estate, energy, sensor data, and other standard tabular settings. The dataset appendix is not glamorous, which is usually a sign that it is doing real work.

Second, it evaluates a mixture of modern tabular foundation models and conventional baselines. The model set includes TabPFN variants, TabICLv2, fine-tuned versions of realTabPFNv2.5 trained under different scoring-rule objectives, XGBoost-based distributional variants, XGBoostLSS, CatBoost quantile models, CREPES conformal wrappers, TabM, and RealMLP variants.

Third, it uses two ranking protocols. One is an ordinal Demšar/autorank approach: rank models within each dataset, average ranks across datasets, then use non-parametric tests and critical-difference diagrams. The other is a magnitude-preserving z-score approach: standardize model performance within each dataset so large and small-scale datasets can be aggregated without letting raw score magnitudes dominate. The paper reports that these two ranking methods are generally highly correlated across metrics, though not perfectly; for example, the correlation is very high for CRPS and coverage metrics, but lower for CDE loss and log-score-like metrics.

This ranking detail matters. It tells us that the headline result is not merely an artifact of one aggregation trick. The authors also run a dataset-size ablation, re-running the benchmark under smaller capped dataset sizes and comparing rank correlations against the original benchmark. Their interpretation is that rankings are mostly robust to dataset size, with some variation across metrics and models. That appendix test is best read as a robustness check, not a second thesis.

The leaderboard is not one leaderboard

The most important result is not “Model X wins.” That would be the comfortable interpretation, and therefore naturally the least useful one.

The important result is that there are multiple plausible leaderboards, because there are multiple plausible definitions of predictive quality.

On CRPS, the leading cluster is dominated by fine-tuned TabPFN variants and fine-tuned TabICLv2. In the paper’s CRPS leaderboard, finetune_tabpfn_realv2_5_crls has the best mean rank, followed closely by finetune_tabiclv2 and other fine-tuned TabPFN variants. The unfine-tuned tabiclv2 is lower, and the older tabpfn_realv2_5 lower again. Conventional baselines such as XGBoost distributional variants and XGBoostLSS are farther down the CRPS ranking.

That is already interesting, but not because one should now declare a universal winner. Please resist the urge; the leaderboard is not a crown ceremony.

The paper gives a sharper example: fine-tuned TabICLv2 improves its mean CRPS rank from 12.00 to 9.44, moving into the leading cluster. That makes sense because it was fine-tuned with a CRPS objective. But the same fine-tuning does not make it the universal winner across all scoring rules, and the authors explicitly note that it does not improve on log score in the same way. This is the mechanism showing up in the results: fine-tuning moves the model toward the metric it is trained to satisfy.

CatBoost provides a useful counterweight. The paper observes that CatBoost improves substantially under CRLS relative to CRPS. Again, the point is not that CatBoost is secretly superior. The point is that changing the scoring rule changes the operational question, and different models answer different operational questions well.

A cleaner way to read the evidence is this:

Result in the paper	Likely purpose	Business meaning	Boundary
ScoringBench evaluates 97 regression datasets with multiple proper scoring rules	Main benchmark contribution	Evaluation can cover distribution quality, not only point accuracy	Public benchmark datasets are still proxies for a firm’s own decision environment
Fine-tuned TabPFN/TabICL variants lead many probabilistic metrics	Main evidence	Fine-tuning objective can shift deployment-relevant behavior	Effect sizes are often small or negligible, though rankings are directionally consistent
Model rankings change across CRPS, CRLS, interval score, coverage, and point metrics	Main evidence	“Best model” depends on the risk question being asked	The paper does not provide a universal business utility function
Autorank and z-score rankings are usually highly correlated	Robustness / methodological support	The findings are not only a ranking-artifact story	Some metrics, such as CDE loss, show more ranking-method sensitivity
Dataset-size ablation mostly preserves rankings	Robustness / sensitivity test	Results are not obviously driven by one arbitrary sample-size cap	Smaller sample regimes still create metric- and model-specific variation
CREPES improves nominal coverage but can degrade other proper scoring rules	Comparison / tradeoff evidence	Coverage compliance is not the same as high-quality uncertainty estimation	Conformal wrappers may still be appropriate when coverage is the primary requirement

This is the part many business readers should care about: the evidence does not say “throw away RMSE.” It says RMSE answers a narrower question than many deployment decisions require.

Coverage can be comforting and still incomplete

The paper’s conformal-prediction result deserves its own discussion because it touches a common governance instinct.

When a model produces uncertainty intervals, many teams ask whether the 90% interval covers the realized value about 90% of the time. That is a reasonable question. It is also not the whole question.

ScoringBench reports that CREPES-based conformal methods can achieve excellent nominal coverage. In the coverage-90 leaderboard, the top methods include CREPES variants coupled with XGBoost, CatBoost, and TabICLv2. If your governance checklist asks, “Does the 90% interval cover roughly 90% of realized values?”, these methods look attractive.

But the paper also notes a tradeoff. Applying CREPES to TabICLv2 improves coverage while degrading performance on scoring rules that TabICLv2 was optimized for. In the interval-score leaderboard, crepes_tabiclv2 ranks highly, but the broader point is more subtle: conformal methods optimize coverage agreement, while proper scoring rules reward a combination of calibration and sharpness.

Coverage alone can be gamed by making intervals wider. A model that says “between zero and infinity” will have heroic coverage and the practical usefulness of a locked filing cabinet. Interval score exists because businesses usually need both: the interval should contain the outcome often enough, and it should not be so wide that it becomes operationally useless.

This is where the quietly annoying statistical vocabulary becomes operationally helpful:

Calibration asks whether predicted probabilities correspond to observed frequencies.
Sharpness asks whether predictions are concentrated enough to be useful.
Coverage asks whether a nominal interval contains the realized value at the promised rate.
Interval score penalizes both missed coverage and excessive width.

A bank, insurer, logistics company, or manufacturer rarely wants “coverage” in isolation. It wants a decision-ready uncertainty estimate. That is a stricter demand.

The small-effect-size result is not a weak result

One easy misreading of the paper is to see many negligible Akinshin effect sizes and conclude that nothing important happened. That would be too quick.

The paper is careful here. Fine-tuning under alternative scoring rules often produces small effect sizes, sometimes negligible under the effect-size categories used by the autorank framework. The authors also note that large gains are not expected because the fine-tuning budget is limited, at most 80 epochs with early stopping.

Yet the rank shifts are consistent enough to matter. Fine-tuned variants cluster near the top under several probabilistic metrics. TabICLv2’s CRPS-oriented fine-tuning improves CRPS ranking. Some models move materially when the metric changes. The dataset-size ablation suggests the broad ranking patterns are not purely sample-cap noise.

For business readers, the lesson is not “fine-tuning magically fixes risk.” It is more restrained and more useful: even modest fine-tuning can redirect model behavior toward the scoring rule being optimized. If the scoring rule encodes the wrong business penalty, the model can become more efficiently wrong.

That sentence is worth keeping.

Fine-tuning is not merely a performance booster. It is a preference amplifier.

RMSE is fine when the business problem is actually an RMSE problem

There is no need to perform ritual violence against RMSE. It remains useful when the target decision really is about average point accuracy and when error costs are reasonably symmetric. Many internal forecasting tasks fit this description well enough.

The problem begins when RMSE is used because it is familiar, not because it matches the decision.

A revenue forecast used for broad planning may tolerate RMSE-style evaluation. A capital adequacy model, fraud thresholding system, inventory allocation model, or insurance pricing engine probably should not. In those contexts, the shape of the predictive distribution affects the action. The user may care about the 95th percentile, lower-tail failure probability, interval width, or the probability mass around a regulatory threshold.

The replacement habit is straightforward:

Current habit	Better question
“Which model has the lowest RMSE?”	“Which model best supports the decision loss we actually face?”
“Does the confidence interval cover 90%?”	“Is the interval calibrated and sharp enough to act on?”
“Which foundation model is best?”	“Best under which scoring rule and deployment cost?”
“Can we add uncertainty after training?”	“Does the uncertainty estimate preserve the distributional quality the model was trained to produce?”
“Can we use the benchmark winner?”	“Does the benchmark metric match our risk asymmetry?”

This is not a plea for methodological sophistication for its own sake. That would be charming in an academic seminar and expensive in a company. The practical point is that evaluation metrics determine what kind of failure the model is allowed to hide.

A practical evaluation workflow for business teams

A business team does not need to implement every scoring rule in ScoringBench tomorrow morning. It does need to stop pretending that model evaluation starts after training.

A better workflow would look like this.

First, define the costly error. Is the expensive mistake overestimation, underestimation, tail undercoverage, interval width, threshold misclassification, or volatility underestimation? Do not answer “all of them.” That is not a risk policy; that is a meeting trying to end.

Second, choose candidate scoring rules before model selection. If the problem is general distributional accuracy, CRPS may be a sensible baseline. If the target is interval decision-making, interval score and coverage diagnostics should be reported together. If tails matter, weighted CRPS or custom scoring rules may be more appropriate. If density quality near observed outcomes matters, log-score-like metrics may become more important.

Third, report point metrics as secondary diagnostics, not as the whole evaluation. RMSE and MAE still help diagnose average predictive accuracy. They should not silently overrule distributional evidence when the deployment decision depends on uncertainty.

Fourth, separate calibration from usefulness. A model can be calibrated but too diffuse. It can be sharp but miscalibrated. It can cover the target while producing intervals too wide for operational action. The paper’s conformal-method result is a neat reminder that passing one uncertainty test may worsen another.

Fifth, run sensitivity checks. ScoringBench uses two ranking methods and dataset-size ablations. A company can use a lighter version: compare rankings across two or three business-relevant metrics, test stability over time periods or customer segments, and inspect whether the “winner” changes when the loss function changes. If it does, that is not a nuisance. That is information.

What the paper directly shows, and what it does not

ScoringBench directly shows that model rankings in tabular distributional regression depend on the chosen scoring rule. It shows that fine-tuned TabPFN and TabICL variants often rank strongly on probabilistic metrics. It shows that conformal coverage gains can come with tradeoffs against other distributional scoring objectives. It also provides an open, extensible benchmark with a git-based leaderboard and a reproducible contribution workflow.

Cognaptus’ business inference is that model evaluation should be designed as a risk-pricing exercise. The metric should encode the cost structure of the decision. Where errors are asymmetric, tail-heavy, or tied to operational thresholds, point metrics should not be the final judge.

The uncertainty boundary is equally important. ScoringBench does not tell a bank, retailer, insurer, or manufacturer which exact scoring rule to use. It does not convert benchmark performance into ROI. It does not prove that a public benchmark winner will dominate on a private production dataset. And it does not remove the need for application-level validation, monitoring, and governance.

In other words, the paper gives teams a better evaluation language. It does not do their risk management homework for them. Tragic, but fair.

The useful question is no longer “Which model wins?”

The older tabular-AI question was simple: can deep learning or foundation models beat tree-based methods on tabular data?

ScoringBench pushes the question into a more useful phase. If a model can produce a predictive distribution, the evaluation should ask whether that distribution supports the decision being made. That means the metric is not an afterthought. It is the formal version of the business preference.

For low-risk point prediction, RMSE may be enough. For risk-sensitive deployment, RMSE can lie by omission. It can hide bad tails, useless intervals, poor calibration, or a model that is accurate on average while mispricing the exact event that matters.

The uncomfortable lesson is that “best model” is not a model property. It is a relationship between model, metric, data, and decision cost.

ScoringBench is valuable because it makes that relationship harder to ignore. And in AI evaluation, as in finance, ignored risk has a habit of becoming very visible later.

Cognaptus: Automate the Present, Incubate the Future.

Jonas Landsgesell, Pascal Knoll, and Tizian Wenzel, “ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules,” arXiv:2603.29928v3, May 4, 2026, https://arxiv.org/abs/2603.29928. ↩︎

The metric is the contract, not the decoration#

What ScoringBench actually builds#

The leaderboard is not one leaderboard#

Coverage can be comforting and still incomplete#

The small-effect-size result is not a weak result#

RMSE is fine when the business problem is actually an RMSE problem#

A practical evaluation workflow for business teams#

What the paper directly shows, and what it does not#

The useful question is no longer “Which model wins?”#