Mind the Flux: Why Average Accuracy Fails Where the Towers Aren’t

TL;DR for operators

Models are often sold as if accuracy were a passport: one clean number, stamped at the border, cleared for deployment. FLUXtrapolation is a useful reminder that the border is usually where the problem begins.

The paper introduces a benchmark for predicting hourly ecosystem fluxes — carbon, water, and energy exchanges between ecosystems and the atmosphere — when direct measurements exist only at sparse flux-tower sites.¹ The mechanism is simple and unpleasant: train models where towers exist, then test them in progressively less comfortable situations where the future, the geography, or the temperature regime has shifted.

The operational lesson is not “use climate AI”. That would be too easy, and therefore suspicious. The lesson is that extrapolation risk must be engineered into the test itself. A model that looks competent under median hourly error can behave quite differently on high-error sites, warmer sites, interannual variability, or coarser temporal summaries. That matters for carbon analytics, climate-risk products, environmental monitoring, insurance exposure models, and any business system trained on convenient observations but deployed into inconvenient reality.

The paper’s pilot study does not crown a universal winner. XGBoost, MLP, and CORAL are strong baselines for evapotranspiration in the benchmark; standard domain-generalisation methods do not reliably turn explicit shift information into better performance. The sharper result is about evaluation design. Median hourly RMSE is a polite metric. FLUXtrapolation asks the impolite questions.

Towers are sparse; models are not

A flux tower is a very expensive way of learning what a landscape is doing. It measures exchanges between an ecosystem and the atmosphere: net ecosystem exchange of carbon dioxide, gross primary productivity, evapotranspiration, and related quantities. These signals matter because they sit directly inside the carbon, water, and energy cycles. They also matter because climate-policy monitoring and carbon-cycle science increasingly depend on products that estimate these quantities across space.

There is one obvious problem. The towers are not everywhere.

They are concentrated where scientists, funding, infrastructure, and data-sharing arrangements have made them possible. The paper uses 207 FLUXNET sites from 2015 to 2022, with coverage concentrated in North America and Europe. That is a serious scientific resource, not a global measurement blanket. So the practical task becomes upscaling: use tower measurements plus globally available covariates — meteorology, satellite-derived indices, site characteristics — to predict fluxes at places without towers.

For machine learning people, this is a domain-generalisation problem. For operators, it is the more familiar problem of making decisions in places where the data were not born.

That distinction matters. A benchmark built from random splits can give a model credit for interpolation dressed up as deployment. FLUXtrapolation instead asks a harder question: what happens when the model must operate beyond the comfortable footprint of its observations?

FLUXtrapolation makes deployment failure observable

The paper’s main contribution is not merely a dataset. It is a benchmark design that turns a vague complaint — “the test distribution may differ” — into specific, inspectable failure modes.

The benchmark uses hourly FLUXNET data and treats three target variables as separate prediction tasks:

Flux	What it represents	Why it behaves differently
ET	Evapotranspiration: water loss through plant transpiration and evaporation	Strongly influenced by observed drivers such as radiation and vapour pressure deficit
GPP	Gross primary productivity: carbon uptake through photosynthesis	Often more directly connected to observed vegetation and meteorological signals
NEE	Net ecosystem exchange: net carbon gain or loss	More affected by biological and ecosystem processes that may be hidden from globally available covariates

This last column is not decorative. It explains why a benchmark cannot assume one flux behaves like another. The paper’s pilot results for gross primary productivity are broadly similar to evapotranspiration, while net ecosystem exchange shows less separation between baselines. That is not an embarrassing inconsistency. It is the point. Some targets are harder because the relevant causes are less visible.

FLUXtrapolation defines domains at either the site level or the site-year level, depending on the scenario. Models receive globally available inputs and must predict fluxes for held-out sites or held-out years. The benchmark then evaluates not only average accuracy, but accuracy across domains, temporal scales, and error tails.

A conventional benchmark asks, “Which model has the lowest error?”

FLUXtrapolation asks, “Lowest error where, under which kind of shift, at which timescale, and for whose worst sites?”

Slightly less convenient. Much more useful.

The benchmark has three doors, not one leaderboard

The mechanism-first view is the cleanest way to read this paper. FLUXtrapolation is built around three extrapolation scenarios, each designed to represent a different kind of deployment discomfort.

Scenario	Train/test logic	Likely purpose	What it stresses
Temporal extrapolation	Train on earlier years, test on later years at already observed sites	Main benchmark scenario	Future-year prediction when site identity is familiar
Random spatial extrapolation	Hold out roughly 20% of sites randomly	Main benchmark scenario	Prediction at unseen sites with moderate covariate overlap
Temperature spatial extrapolation	Hold out the 40 warmest sites	Main benchmark scenario	Prediction at unseen, warmer sites where covariate shift is deliberately amplified

Temporal extrapolation is the mildest in one sense. The model has already seen the sites, so the geography is familiar. But future years are not duplicates of past years, and ecosystem time series have strong temporal dependence. Treating hourly observations as if they were independent spreadsheet rows would be a charming little act of statistical vandalism.

Random spatial extrapolation is harder. The model must predict at sites it has never seen. But because the sites are randomly selected, the held-out set remains broadly similar to the training set.

Temperature spatial extrapolation is the most pointed design choice. It holds out the warmest sites, combining unseen geography with a shift toward warmer conditions. This is not a perfect simulation of global warming. It is a controlled stress test aimed at a scientifically plausible failure mode: models trained on the existing tower network may be weakest where the target regime is underrepresented.

The important move is that these scenarios are not just harder in a single scalar sense. They produce different kinds of distribution shift. That is where the paper becomes more interesting than the usual “new benchmark, please leaderboard responsibly” exercise.

Shift diagnosis prevents lazy explanations

The paper distinguishes two forms of shift.

The first is covariate shift: the input distribution changes. The model sees one distribution of temperatures, vegetation indices, water indices, radiation, or site characteristics during training, then faces another during testing.

The second is conditional shift: the relationship between inputs and target changes. This can happen when important drivers are not available as model inputs. Soil properties, understory structure, ecosystem history, instrumentation differences, and biological processes may alter the input–flux relationship in ways the model cannot fully observe.

This distinction is not academic fussiness. It changes what failure means.

If a model fails mainly under covariate shift, the issue may be that the target domain lies outside the input support learned during training. More representative data, better reweighting, domain adaptation, or uncertainty-aware rejection may help.

If a model fails under conditional shift, the deeper problem is that the same observed inputs no longer imply the same target behaviour. Adding another algorithmic wrapper may not repair missing causal information. It may merely make the wrong answer look more modern.

FLUXtrapolation estimates covariate shift using a domain classifier: if a classifier can distinguish training inputs from test inputs with high balanced accuracy, the two input distributions are separable. It estimates conditional shift by comparing held-out training RMSE with importance-weighted test RMSE after adjusting for differences in the input distribution on regions of common support.

The paper’s crucial observation is that the two shift types do not line up neatly. Covariate shift increases from temporal to random spatial to temperature-based extrapolation. Conditional shift is present in all scenarios but does not follow the same clean ordering. For evapotranspiration, the conditional-shift diagnostic is pronounced for both temporal and temperature-based extrapolation, but smaller for random spatial extrapolation.

This breaks a lazy interpretation: “the warmer-site test is simply harder, therefore all error comes from bigger input shift.” No. Some failures come from where the model is asked to predict. Others come from whether the available inputs still describe the same physical relationship. Business translation: do not diagnose all deployment error as “more data drift”. Sometimes the dashboard is missing the variable that actually matters. Always inconvenient when reality declines to be tabular.

Evaluation moves from average accuracy to operational risk

The second major mechanism is evaluation design. FLUXtrapolation evaluates error at the domain level — sites or site-years — rather than only pooling observations into one aggregate score. It then summarises the distribution of domain errors using both the median and the 90th percentile.

That is the difference between asking how the typical site performs and asking how the difficult sites behave. For environmental products, the difficult sites are often the point. Extreme ecosystems, underrepresented regions, warmer conditions, wet and dry edges, and poorly covered geographies are exactly where decision-makers may care most about robust estimates.

The benchmark also evaluates predictions across temporal aggregations:

Evaluation scale	What it checks	Operational interpretation
Hourly	Native prediction accuracy	Does the model track fine-grained flux behaviour?
Weekly	Short-term aggregated behaviour	Does noise cancel into useful summaries?
Seasonal / mean seasonal cycle	Recurring within-year pattern	Does the model capture typical seasonal structure?
Anomalies	Deviations from seasonal baseline	Does the model detect unusual behaviour beyond the expected cycle?
Interannual variability	Year-to-year fluctuations	Does the model capture longer-term dynamics?
Site mean	Persistent differences across sites	Does the model capture stable spatial differences?

This is not metric inflation for its own sake. Scientific and operational uses rarely consume raw hourly predictions in isolation. A carbon-monitoring product may care about annual totals. A drought-risk model may care about anomalies. A climate-risk workflow may care about whether the model behaves sensibly in warm or underrepresented regions.

A model can win the polite contest and lose the useful one.

Pilot evidence: the comfortable leaderboard breaks in the tails

The paper’s pilot study evaluates constant prediction, linear regression, XGBoost, MLP, and three domain-generalisation baselines: CORAL, MMD regularisation, and Group DRO. The pilot focuses its main-text discussion on evapotranspiration, with ET values scaled by 100 for readability.

Under a simple evaluation — median hourly RMSE for temporal and random spatial extrapolation — the strong baselines are close. In the appendix table for this simple setup, temporal ET median hourly RMSE is 3.9 for MLP and Group DRO, 4.0 for XGBoost and CORAL. Under random spatial extrapolation, XGBoost and Group DRO are at 4.7, MLP at 4.9, CORAL at 5.0. That is a narrow contest. Procurement departments love narrow contests because they can then choose the model with the nicest slide deck.

FLUXtrapolation makes that harder.

When the benchmark adds the temperature-based scenario, the median hourly RMSE for the best baseline increases from 3.9 in temporal extrapolation to 4.7 in random spatial extrapolation and 5.6 in temperature-based spatial extrapolation. The order of difficulty becomes visible.

When the benchmark moves to the 90th percentile of domain-level RMSE, separation becomes clearer. In the ET 90th-percentile summary score, CORAL and MLP score 0.17 relative to linear regression, XGBoost scores 0.16, Group DRO falls to 0.02, MMD drops to -0.25, and the constant baseline sits at -0.37. This is not a microscopic stylistic preference. It changes what counts as robust.

A particularly useful example comes from the temperature-based spatial test at the hourly 90th percentile. XGBoost records 8.4; CORAL and MLP are at 9.0; linear regression and Group DRO are at 10; MMD reaches 14; the constant predictor reaches 16. Under the benchmark’s tail view, “modern method” is no longer a magic phrase. It is just a method, standing there under fluorescent evaluation lighting, hoping nobody asks about the warmest sites.

The paper also finds failure modes that are not solved by the strong baselines. For interannual variability under random spatial extrapolation, all baselines perform similarly to the constant predictor. For anomalies and interannual variability under temporal extrapolation, baselines also approach constant prediction or linear regression, with the caveat that temporal IAV has limited power because there are only a few test years.

That is a valuable result because it identifies where the benchmark is not merely ranking existing methods. It is exposing unsolved structure.

Domain-generalisation methods do not get extra credit for the label

A tempting misconception is that once a benchmark labels shift explicitly, specialised domain-generalisation methods should automatically outperform ordinary empirical risk minimisation. The paper does not support that comforting story.

The pilot result is more severe and more useful: explicit shift information is diagnostically valuable, but current baseline domain-generalisation methods do not reliably exploit it.

For temporal extrapolation, where covariate shift is minimal and conditional shift is moderate, the domain-generalisation baselines perform similarly to ERM baselines. Group DRO is designed to improve worst-group robustness on training domains, but that does not translate into better test-tail RMSE.

For random spatial extrapolation, which is more dominated by covariate shift, MMD performs poorly in the pilot. CORAL is the domain-generalisation method that consistently stays competitive with XGBoost and MLP.

The right inference is not “domain generalisation is useless”. That would be the cheap take, and cheap takes have high gross margins but low nutritional value. The better inference is that shift awareness must be connected to the physical and statistical structure of the task. Knowing that a deployment setting has covariate shift, conditional shift, temporal dependence, and site heterogeneity does not mean a generic regulariser will know what to do with them.

For businesses, this matters because “we use a domain adaptation method” is not an assurance. It is a line item. The assurance comes from demonstrating performance under the deployment-shaped test.

The appendix is mostly test hygiene, not a second thesis

The paper’s appendix is not a bag of unrelated extras. Its components mostly clarify robustness, implementation, and evaluation mechanics.

Paper component	Likely purpose	What it supports	What it does not prove
Figure 2 shift diagnostics	Main evidence	The three extrapolation scenarios induce different covariate and conditional shifts	That the diagnostic fully captures every deployment risk
Table 1 ET 90th-percentile summary	Main evidence	Tail and multi-scale evaluation separate baselines more clearly	That CORAL, MLP, or XGBoost will dominate future methods
Figure 3 cumulative weekly error distributions	Main evidence / diagnostic visualization	Baselines that look similar in median can diverge at higher quantiles	That one quantile is sufficient for all decisions
Table 3 alternative conditional-shift diagnostic	Robustness / sensitivity test	Conditional-shift patterns are broadly preserved under a shared-support reference marginal	That conditional-shift estimates are invariant to all weighting choices
Appendix C aggregation definitions	Implementation detail	Hourly, weekly, seasonal, anomaly, IAV, and site-mean metrics are constructed systematically	That these are the only useful temporal scales
Appendix D training details	Implementation detail	Baselines use reproducible tuning and fixed random seeds	That the pilot exhausts the model-design space
Appendix E full flux results	Exploratory extension / complementary evidence	No method dominates across ET, GPP, and NEE; NEE shows weaker separation	That the ET-focused main-text interpretation transfers unchanged to every flux

The robustness check for conditional-shift diagnostics deserves particular attention. The main diagnostic evaluates errors under the training marginal distribution using importance weighting. The appendix also evaluates under a shared-support reference marginal. The qualitative patterns are broadly similar, though magnitudes change. For GPP and NEE, the shared-marginal diagnostic assigns relatively more conditional shift to temperature extrapolation than the training-marginal version.

That is exactly how a serious diagnostic should behave: stable enough to support the broad interpretation, sensitive enough to remind readers that weighting choices are not theology.

What the paper directly shows, and what Cognaptus infers

The paper directly shows three things.

First, FLUXtrapolation provides a fixed benchmark for hourly ecosystem-flux extrapolation using data derived from the FLUXCOM-X-style pipeline, with reproducible splits and evaluation protocols. This matters because existing flux-upscaling work is scientifically mature, but method comparison benefits from a portable benchmark.

Second, the benchmark’s three scenarios generate distinct shift structures. Covariate shift grows from temporal to spatial to temperature-based spatial extrapolation. Conditional shift appears in all scenarios, but does not obey the same ordering. This makes the benchmark more than a difficulty ladder; it is a diagnostic device.

Third, the pilot study shows that median hourly RMSE can hide meaningful differences. Tail-focused, domain-level, and multi-scale evaluation separates baselines and exposes failures in anomalies and interannual variability. XGBoost, MLP, and CORAL are strong baselines for the ET-focused analysis; standard domain-generalisation methods do not reliably convert shift information into improved performance.

Cognaptus infers the following for business use.

For carbon analytics, climate-risk modelling, environmental monitoring, and ESG-adjacent measurement products, the model-selection question should not start with architecture. It should start with the deployment map. Where are observations dense? Where will predictions be sold, used, audited, insured, financed, or reported? Which domains are high-risk because they are underrepresented, warmer, drier, wetter, operationally remote, or scientifically unusual?

The evaluation protocol should then imitate those failure modes. Not perfectly. Perfect realism is usually how evaluation projects die in committee. But enough to make the model face the kinds of shifts it will meet outside the training data.

For non-climate AI, the same design logic transfers cleanly. Retail demand models trained in mature stores and deployed to new neighbourhoods. Credit models trained in banked populations and deployed to thinner-file customers. Maintenance models trained on heavily instrumented assets and deployed to older equipment. Medical triage tools trained in large hospitals and deployed in rural clinics. In each case, the dangerous question is not “what is the average test error?” It is “which domains did we make invisible by averaging?”

Business value is cheaper diagnosis, not just better prediction

The practical value of FLUXtrapolation is not that it offers a ready-made commercial climate-risk engine. It does not. Its value is that it makes diagnosis cheaper before deployment damage becomes expensive.

A benchmark like this can support three operator behaviours.

First, it forces explicit split design. Instead of claiming that the train/test split is representative because it was random, the evaluation must state what kind of deployment it approximates: future years, unseen locations, warmer locations, or some other stress condition.

Second, it makes tail performance reportable. The 90th percentile of domain-level error is not a niche statistic when your product may be judged by performance in underrepresented regions. The median tells you how the comfortable middle behaves. The tail tells you where angry emails are likely to come from.

Third, it connects metrics to decision horizons. Hourly performance, weekly aggregation, seasonal cycles, anomalies, interannual variability, and site means are not interchangeable. A model used for operational water stress, annual carbon accounting, and anomaly detection may need to pass different tests.

This is where AI governance becomes less theatrical. The point is not to add a “model risk” checkbox after training. The point is to design the benchmark so the model has fewer places to hide.

Boundaries: this is a stress test, not the planet in miniature

The paper is careful about its boundaries, and the business interpretation should be equally disciplined.

FLUXtrapolation is not the full global upscaling problem. It holds out known tower sites or site-years to approximate towerless prediction. That is useful, but it is still an approximation. The true target regions may differ more severely from FLUXNET sites than held-out towers do.

The data cover available sites from 2015 to 2022 with associated VIIRS-based remote sensing. This makes the benchmark contemporary and useful, but also geographically uneven. North America and Europe remain overrepresented. Regions where upscaling may matter greatly can remain sparse.

The shift diagnostics depend on observed covariates and common-support assumptions. If the decisive hidden variable is not captured in the input data, the diagnostic can point to conditional shift but cannot magically measure the missing cause. Importance weighting also depends on estimated density ratios and clipping choices. The appendix robustness check helps, but it does not abolish the usual fragility of weighting under limited support.

The pilot study is a pilot study. It demonstrates benchmark separability and reveals useful failure modes. It does not prove that XGBoost, MLP, or CORAL are the final answer, nor that all future domain-generalisation methods will fail. It simply raises the entry price for claims of robustness.

That is enough.

Trust the test before you trust the model

FLUXtrapolation is about ecosystem fluxes, but its broader message is about institutional self-defence.

The model may be sophisticated. The benchmark decides whether that sophistication meets reality. A median hourly score can make several models look comfortably similar. A tail-focused, deployment-shaped, multi-scale evaluation can reveal that the comfort was mostly accounting.

For operators, the lesson is blunt: when observations are sparse and deployment domains are uneven, do not buy model confidence by averaging away the hard places. Build the hard places into the test.

Climate systems have enough hidden variables already. Business systems, regrettably, keep trying to add more.

Cognaptus: Automate the Present, Incubate the Future.

Anya Fries, Jacob A. Nelson, Martin Jung, Markus Reichstein, and Jonas Peters, “FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes,” arXiv:2605.19812, 2026. https://arxiv.org/abs/2605.19812 ↩︎

TL;DR for operators#

Towers are sparse; models are not#

FLUXtrapolation makes deployment failure observable#

The benchmark has three doors, not one leaderboard#

Shift diagnosis prevents lazy explanations#

Evaluation moves from average accuracy to operational risk#

Pilot evidence: the comfortable leaderboard breaks in the tails#

Domain-generalisation methods do not get extra credit for the label#

The appendix is mostly test hygiene, not a second thesis#

What the paper directly shows, and what Cognaptus infers#

Business value is cheaper diagnosis, not just better prediction#

Boundaries: this is a stress test, not the planet in miniature#

Trust the test before you trust the model#