Opening — Why this matters now

Most AI models today don’t just predict outcomes — they predict uncertainty. And yet, oddly enough, we still judge them as if they don’t.

In finance, healthcare, and infrastructure, the difference between “slightly wrong” and “catastrophically wrong” is rarely symmetric. But the metrics we use — RMSE, $R^2$ — behave as if all errors are created equal. This is not just a technical oversight. It’s a structural blind spot.

The paper ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules fileciteturn0file0 makes a rather uncomfortable point: modern tabular foundation models already output full probability distributions, but we evaluate them as if they were producing single numbers.

It’s a bit like judging a weather forecast by asking only, “Was it sunny?”

Background — Context and prior art

Tabular machine learning has quietly undergone a shift. Models like TabPFN and TabICL are no longer just regression engines — they are probabilistic systems that output full predictive distributions.

This matters because:

  • A point prediction gives you what might happen
  • A distribution tells you how confident you should be

Yet existing benchmarks — notably TabArena and TALENT — reduce this richness to point metrics such as RMSE and $R^2$.

From a statistical standpoint, this is almost heretical. Proper scoring rules — introduced by Gneiting and Raftery — provide a principled way to evaluate probabilistic forecasts. They ensure that the best score is achieved only when the predicted distribution matches reality.

But here’s the twist: there isn’t just one proper scoring rule. There are many — and each encodes a different philosophy of error.

Analysis — What the paper actually does

The authors introduce ScoringBench, a benchmarking framework that evaluates models using a suite of proper scoring rules instead of a single metric.

Key idea: Metrics are not neutral

Different scoring rules emphasize different parts of the prediction:

Metric What it emphasizes Business implication
RMSE / $R^2$ Mean accuracy Stable environments
CRPS Overall distribution quality General-purpose forecasting
CRLS Rare events / tail penalties Risk-sensitive domains
Interval Score Confidence interval reliability Operational planning
Weighted CRPS Custom emphasis (e.g. tails) Domain-specific risk
Brier Score Probability calibration Classification-style risk

The implication is simple but uncomfortable:

A model that looks “best” under one metric may be mediocre — or dangerous — under another.

Inductive bias through evaluation

One of the more subtle findings is that training objectives change model behavior.

Even though proper scoring rules theoretically share the same optimal solution (the true distribution), in practice:

  • Finite data
  • Model constraints
  • Optimization dynamics

…mean that each scoring rule nudges the model differently.

The paper shows that fine-tuning the same model (TabPFN) with different scoring rules leads to different rankings across datasets.

In other words:

Your evaluation metric is not just measuring performance — it is shaping it.

Findings — What actually changes (with evidence)

The results confirm that rankings are unstable across metrics.

Example: Model rankings shift by metric

Metric Top Model Observation
CRLS TabICL Strong tail awareness
CRPS TabICL Best overall distribution fit
$R^2$ Fine-tuned TabPFN (CRLS) Better mean prediction

This divergence is not noise — it is structural.

Why this happens

Different scoring rules penalize errors asymmetrically:

  • Underestimation vs overestimation are treated differently
  • Tail events receive varying importance
  • Calibration vs sharpness trade-offs emerge

A useful way to think about it:

Scenario Wrong metric outcome
Financial risk Underestimates crash probability
Healthcare Underestimates adverse effects
Infrastructure Underestimates stress/load

In all cases, the model may still score well under RMSE.

Which is… not reassuring.

Implications — What this means for real systems

1. Evaluation is a business decision, not a technical detail

Choosing a metric is equivalent to choosing:

  • What errors you tolerate
  • What risks you prioritize
  • What failures you accept

Most organizations delegate this choice to data scientists. That’s convenient — and usually wrong.

2. One benchmark is not enough

A single leaderboard hides more than it reveals. ScoringBench’s multi-metric approach shows that:

  • Model superiority is conditional
  • Rankings are context-dependent
  • Trade-offs are unavoidable

This challenges the entire notion of “the best model.”

3. Tail risk must be explicitly modeled

For high-stakes domains, the paper suggests:

  • Using weighted scoring rules
  • Designing custom loss functions
  • Aligning training objectives with real-world costs

This is particularly relevant for:

  • Liquidity forecasting
  • Value-at-Risk estimation
  • Energy grid balancing

Where the cost of being wrong is highly asymmetric.

4. Data without “error structure” is incomplete

One of the more quietly devastating observations:

Most datasets don’t specify what kind of error matters.

Which means models are trained in a vacuum — optimizing abstract metrics rather than real-world outcomes.

Conclusion — A quiet but critical shift

ScoringBench doesn’t introduce a new model. It introduces something more disruptive: a new way of judging them.

And that changes everything.

Because once you accept that:

  • Metrics encode values
  • Values shape models
  • Models drive decisions

…you realize that evaluation is not the end of the pipeline.

It’s the beginning.

Cognaptus: Automate the Present, Incubate the Future.