Opening — Why this matters now
Most AI models today don’t just predict outcomes — they predict uncertainty. And yet, oddly enough, we still judge them as if they don’t.
In finance, healthcare, and infrastructure, the difference between “slightly wrong” and “catastrophically wrong” is rarely symmetric. But the metrics we use — RMSE, $R^2$ — behave as if all errors are created equal. This is not just a technical oversight. It’s a structural blind spot.
The paper ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules fileciteturn0file0 makes a rather uncomfortable point: modern tabular foundation models already output full probability distributions, but we evaluate them as if they were producing single numbers.
It’s a bit like judging a weather forecast by asking only, “Was it sunny?”
Background — Context and prior art
Tabular machine learning has quietly undergone a shift. Models like TabPFN and TabICL are no longer just regression engines — they are probabilistic systems that output full predictive distributions.
This matters because:
- A point prediction gives you what might happen
- A distribution tells you how confident you should be
Yet existing benchmarks — notably TabArena and TALENT — reduce this richness to point metrics such as RMSE and $R^2$.
From a statistical standpoint, this is almost heretical. Proper scoring rules — introduced by Gneiting and Raftery — provide a principled way to evaluate probabilistic forecasts. They ensure that the best score is achieved only when the predicted distribution matches reality.
But here’s the twist: there isn’t just one proper scoring rule. There are many — and each encodes a different philosophy of error.
Analysis — What the paper actually does
The authors introduce ScoringBench, a benchmarking framework that evaluates models using a suite of proper scoring rules instead of a single metric.
Key idea: Metrics are not neutral
Different scoring rules emphasize different parts of the prediction:
| Metric | What it emphasizes | Business implication |
|---|---|---|
| RMSE / $R^2$ | Mean accuracy | Stable environments |
| CRPS | Overall distribution quality | General-purpose forecasting |
| CRLS | Rare events / tail penalties | Risk-sensitive domains |
| Interval Score | Confidence interval reliability | Operational planning |
| Weighted CRPS | Custom emphasis (e.g. tails) | Domain-specific risk |
| Brier Score | Probability calibration | Classification-style risk |
The implication is simple but uncomfortable:
A model that looks “best” under one metric may be mediocre — or dangerous — under another.
Inductive bias through evaluation
One of the more subtle findings is that training objectives change model behavior.
Even though proper scoring rules theoretically share the same optimal solution (the true distribution), in practice:
- Finite data
- Model constraints
- Optimization dynamics
…mean that each scoring rule nudges the model differently.
The paper shows that fine-tuning the same model (TabPFN) with different scoring rules leads to different rankings across datasets.
In other words:
Your evaluation metric is not just measuring performance — it is shaping it.
Findings — What actually changes (with evidence)
The results confirm that rankings are unstable across metrics.
Example: Model rankings shift by metric
| Metric | Top Model | Observation |
|---|---|---|
| CRLS | TabICL | Strong tail awareness |
| CRPS | TabICL | Best overall distribution fit |
| $R^2$ | Fine-tuned TabPFN (CRLS) | Better mean prediction |
This divergence is not noise — it is structural.
Why this happens
Different scoring rules penalize errors asymmetrically:
- Underestimation vs overestimation are treated differently
- Tail events receive varying importance
- Calibration vs sharpness trade-offs emerge
A useful way to think about it:
| Scenario | Wrong metric outcome |
|---|---|
| Financial risk | Underestimates crash probability |
| Healthcare | Underestimates adverse effects |
| Infrastructure | Underestimates stress/load |
In all cases, the model may still score well under RMSE.
Which is… not reassuring.
Implications — What this means for real systems
1. Evaluation is a business decision, not a technical detail
Choosing a metric is equivalent to choosing:
- What errors you tolerate
- What risks you prioritize
- What failures you accept
Most organizations delegate this choice to data scientists. That’s convenient — and usually wrong.
2. One benchmark is not enough
A single leaderboard hides more than it reveals. ScoringBench’s multi-metric approach shows that:
- Model superiority is conditional
- Rankings are context-dependent
- Trade-offs are unavoidable
This challenges the entire notion of “the best model.”
3. Tail risk must be explicitly modeled
For high-stakes domains, the paper suggests:
- Using weighted scoring rules
- Designing custom loss functions
- Aligning training objectives with real-world costs
This is particularly relevant for:
- Liquidity forecasting
- Value-at-Risk estimation
- Energy grid balancing
Where the cost of being wrong is highly asymmetric.
4. Data without “error structure” is incomplete
One of the more quietly devastating observations:
Most datasets don’t specify what kind of error matters.
Which means models are trained in a vacuum — optimizing abstract metrics rather than real-world outcomes.
Conclusion — A quiet but critical shift
ScoringBench doesn’t introduce a new model. It introduces something more disruptive: a new way of judging them.
And that changes everything.
Because once you accept that:
- Metrics encode values
- Values shape models
- Models drive decisions
…you realize that evaluation is not the end of the pipeline.
It’s the beginning.
Cognaptus: Automate the Present, Incubate the Future.