When RMSE Lies: Why Your AI Model Might Be Quietly Mispricing Risk

Opening — Why this matters now

Most AI models today don’t just predict outcomes — they predict uncertainty. And yet, oddly enough, we still judge them as if they don’t.

In finance, healthcare, and infrastructure, the difference between “slightly wrong” and “catastrophically wrong” is rarely symmetric. But the metrics we use — RMSE, $R^2$ — behave as if all errors are created equal. This is not just a technical oversight. It’s a structural blind spot.

The paper ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules fileciteturn0file0 makes a rather uncomfortable point: modern tabular foundation models already output full probability distributions, but we evaluate them as if they were producing single numbers.

It’s a bit like judging a weather forecast by asking only, “Was it sunny?”

Background — Context and prior art

Tabular machine learning has quietly undergone a shift. Models like TabPFN and TabICL are no longer just regression engines — they are probabilistic systems that output full predictive distributions.

This matters because:

A point prediction gives you what might happen
A distribution tells you how confident you should be

Yet existing benchmarks — notably TabArena and TALENT — reduce this richness to point metrics such as RMSE and $R^2$.

From a statistical standpoint, this is almost heretical. Proper scoring rules — introduced by Gneiting and Raftery — provide a principled way to evaluate probabilistic forecasts. They ensure that the best score is achieved only when the predicted distribution matches reality.

But here’s the twist: there isn’t just one proper scoring rule. There are many — and each encodes a different philosophy of error.

Analysis — What the paper actually does

The authors introduce ScoringBench, a benchmarking framework that evaluates models using a suite of proper scoring rules instead of a single metric.

Key idea: Metrics are not neutral

Different scoring rules emphasize different parts of the prediction:

Metric	What it emphasizes	Business implication
RMSE / $R^2$	Mean accuracy	Stable environments
CRPS	Overall distribution quality	General-purpose forecasting
CRLS	Rare events / tail penalties	Risk-sensitive domains
Interval Score	Confidence interval reliability	Operational planning
Weighted CRPS	Custom emphasis (e.g. tails)	Domain-specific risk
Brier Score	Probability calibration	Classification-style risk

The implication is simple but uncomfortable:

A model that looks “best” under one metric may be mediocre — or dangerous — under another.

Inductive bias through evaluation

One of the more subtle findings is that training objectives change model behavior.

Even though proper scoring rules theoretically share the same optimal solution (the true distribution), in practice:

Finite data
Model constraints
Optimization dynamics

…mean that each scoring rule nudges the model differently.

The paper shows that fine-tuning the same model (TabPFN) with different scoring rules leads to different rankings across datasets.

In other words:

Your evaluation metric is not just measuring performance — it is shaping it.

Findings — What actually changes (with evidence)

The results confirm that rankings are unstable across metrics.

Example: Model rankings shift by metric

Metric	Top Model	Observation
CRLS	TabICL	Strong tail awareness
CRPS	TabICL	Best overall distribution fit
$R^2$	Fine-tuned TabPFN (CRLS)	Better mean prediction

This divergence is not noise — it is structural.

Why this happens

Different scoring rules penalize errors asymmetrically:

Underestimation vs overestimation are treated differently
Tail events receive varying importance
Calibration vs sharpness trade-offs emerge

A useful way to think about it:

Scenario	Wrong metric outcome
Financial risk	Underestimates crash probability
Healthcare	Underestimates adverse effects
Infrastructure	Underestimates stress/load

In all cases, the model may still score well under RMSE.

Which is… not reassuring.

Implications — What this means for real systems

1. Evaluation is a business decision, not a technical detail

Choosing a metric is equivalent to choosing:

What errors you tolerate
What risks you prioritize
What failures you accept

Most organizations delegate this choice to data scientists. That’s convenient — and usually wrong.

2. One benchmark is not enough

A single leaderboard hides more than it reveals. ScoringBench’s multi-metric approach shows that:

Model superiority is conditional
Rankings are context-dependent
Trade-offs are unavoidable

This challenges the entire notion of “the best model.”

3. Tail risk must be explicitly modeled

For high-stakes domains, the paper suggests:

Using weighted scoring rules
Designing custom loss functions
Aligning training objectives with real-world costs

This is particularly relevant for:

Liquidity forecasting
Value-at-Risk estimation
Energy grid balancing

Where the cost of being wrong is highly asymmetric.

4. Data without “error structure” is incomplete

One of the more quietly devastating observations:

Most datasets don’t specify what kind of error matters.

Which means models are trained in a vacuum — optimizing abstract metrics rather than real-world outcomes.

Conclusion — A quiet but critical shift

ScoringBench doesn’t introduce a new model. It introduces something more disruptive: a new way of judging them.

And that changes everything.

Because once you accept that:

Metrics encode values
Values shape models
Models drive decisions

…you realize that evaluation is not the end of the pipeline.

It’s the beginning.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

Key idea: Metrics are not neutral#

Inductive bias through evaluation#

Findings — What actually changes (with evidence)#

Example: Model rankings shift by metric#

Why this happens#

Implications — What this means for real systems#

1. Evaluation is a business decision, not a technical detail#

2. One benchmark is not enough#

3. Tail risk must be explicitly modeled#

4. Data without “error structure” is incomplete#

Conclusion — A quiet but critical shift#