Forecasting is where machine learning often learns humility.

A language model can sound clever while being wrong. A forecasting model has fewer hiding places. Revenue arrives or it does not. CPU saturation happens or it does not. Demand spikes, latency drifts, inventories rot, turbines fail, and the spreadsheet smiles politely before punishing everyone involved. This is why time-series foundation models have been treated with a particular kind of suspicion: useful, interesting, sometimes impressive, but not yet comfortably scalable in the way large language models became scalable.

Datadog’s Toto 2.0 paper tries to move that line.1 The paper’s headline claim is simple: a single training recipe produces reliable forecast-quality gains from 4 million to 2.5 billion parameters. The more important part is less headline-friendly: the scaling result is not a magic property of parameter count. It is produced by a linked recipe: contiguous patch masking, a quantile output head, NorMuon optimization, synthetic and observability-only pretraining, and a u-µP hyperparameter transfer pipeline.

That distinction matters. The lazy reading is “bigger forecasting models now work.” The useful reading is “scaling forecasting models becomes operationally plausible when the training and inference recipe removes several bottlenecks at once.” The first reading is a slogan. The second is a system design lesson.

The result is a scaling claim, but the story is a recipe

Toto 2.0 is a family of five open-weight forecasting models: 4m, 22m, 313m, 1B, and 2.5B parameters. The paper reports state-of-the-art results on three benchmarks: BOOM, an observability benchmark; GIFT-Eval, a broad general forecasting benchmark; and TIME, a newer contamination-resistant benchmark built from fresh datasets.

The result is unusually clean for the time-series foundation model field. On BOOM, all Toto 2.0 sizes sit on the Pareto frontier, and the largest three lead with CRPS ranks of 3.88, 3.96, and 4.26. The 22m model beats Toto 1.0’s CRPS rank of 6.94 with roughly seven times fewer parameters; the 4m model is competitive with Toto 1.0 and Chronos-2 despite being much smaller. On GIFT-Eval foundation-model comparisons, the three largest Toto 2.0 models take the top three CRPS-rank positions: 20.3, 21.1, and 21.4. On TIME, the three largest models again take the top three slots across metrics, although the 313m model edges out the 1B model on MASE and rank metrics. That small inversion is useful. It keeps everyone from putting “bigger is always better” on a procurement slide, where good ideas go to become budget errors.

But the paper is not just another leaderboard report. Its real contribution is the recipe that makes the leaderboard credible.

Design component What it changes Operational consequence Evidence role in the paper
Contiguous patch masking Replaces step-by-step autoregressive decoding with single-pass horizon prediction Forecast latency becomes much flatter across horizon length; error compounding is reduced Main mechanism plus latency evidence
Quantile output head Replaces Toto 1.0’s Student-T mixture with nine quantile predictions Probabilistic forecasting becomes more stable at scale Main architectural change
NorMuon Matches optimizer behavior to pinball-loss gradients Training large quantile models becomes more stable and efficient Mechanism argument plus tuned recipe
Observability + synthetic pretraining Removes public time-series pretraining from base models Public benchmark performance becomes a stronger test of cross-domain generalization Data-mixture evidence and generalization argument
u-µP transfer Tunes a 10m proxy once and transfers settings to larger models A model family becomes trainable without five separate hyperparameter hunts Scaling infrastructure contribution

A mechanism-first reading is therefore the right one. The benchmark scores matter, but they are downstream of a more interesting claim: time-series model scaling starts to look like infrastructure only when architecture, loss, optimizer, data, and hyperparameter transfer are made mutually compatible.

CPM changes the economics of forecasting, not just the architecture diagram

Toto 1.0 forecasted autoregressively. For a long horizon, it extended the series one patch at a time. If the forecast horizon is $H$ and the patch size is $P$, the model may need $K = H/P$ sequential calls. That is not just slower. It also invites compounding errors: early mistakes become context for later predictions, and the model’s confidence gets to drift elegantly into nonsense.

Toto 2.0 replaces this with contiguous patch masking. During training, the model sees variable-length masked spans and learns to fill them. At inference, the entire forecast horizon is represented as masked future patches, and the transformer predicts them in a single forward pass. The model is trained to do the thing it will be asked to do later: recover a contiguous missing future span.

This is not decorative alignment between pretraining and inference; it changes the cost curve. In the latency study, a 1,024-step forecast requires up to 16 autoregressive steps for Toto 1.0 but one forward pass for single-pass Toto 2.0. Every Toto 2.0 size is significantly faster than Toto 1.0 at this horizon. The paper also reports that the 313m model runs at roughly the same latency as Chronos-2, despite Chronos-2 being 120m parameters. At a 4,096-step horizon, the 2.5B model in single-pass mode remains faster than Chronos-2.

For business forecasting, this is the part that is easy to underread. Accuracy improvements are nice; latency improvements change where a model can be used. A model that can issue long probabilistic forecasts quickly is not only a batch-planning tool. It can become part of monitoring, simulation, anomaly triage, scenario planning, or agentic operations where forecasts must be refreshed repeatedly.

There is a boundary. The paper distinguishes single-pass decoding from block decoding. Single-pass is fast and used for leaderboard submissions. Block decoding forecasts in segments, conditioning each block on the previous segment’s median, reusing KV cache, and trading speed for long-horizon stability. The authors find single-pass generally stable up to roughly a 768-step horizon on synthetic multi-scale signals. For the long-horizon stability experiment, they use block decoding. So the mechanism is not “one pass solves all horizons.” It is more precise: CPM gives the system a fast default path and a more conservative block path when horizon length threatens coherence.

That is less magical. Also more useful.

The quantile head makes probability practical, then forces an optimizer rethink

Toto 1.0 used a Student-T mixture model to produce probabilistic forecasts. Toto 2.0 switches to a quantile output head, predicting nine quantiles from 0.1 to 0.9 and training with pinball loss.

For target $y$ and predicted quantile $\hat{q}_\tau$, the pinball loss at quantile level $\tau$ is:

$$ \rho_\tau(y - \hat{q}\tau) = (y - \hat{q}\tau)(\tau - \mathbf{1}[y < \hat{q}_\tau]) $$

The model averages this loss over the nine quantiles. This is a sensible choice for production forecasting because many business decisions need distributional information, not just a single expected value. A warehouse manager cares about upper-tail demand. An SRE team cares about high-percentile latency risk. A finance team cares about downside intervals, unless it is trying very hard not to.

The switch also avoids practical issues the authors encountered with the Student-T mixture at scale: numerical instability at large activations and divergence when predictions approach zero because of the variance term in its normalization. Quantile heads are not new in forecasting, but in Toto 2.0 they are part of the scaling recipe rather than a cosmetic output choice.

The optimizer section is where the paper becomes more interesting than the usual “we used optimizer X” footnote. Pinball loss has sign-valued gradients. The derivative with respect to the predicted quantile takes only three values: $-\tau$, $0$, or $1-\tau$, depending on whether the prediction is below, equal to, or above the target. Unlike mean squared error, the gradient magnitude does not grow with the size of the error.

That matters because AdamW’s adaptive behavior depends heavily on gradient variance. AdamW can still train the model, but the paper argues that its useful dynamic range becomes limited under sign-valued pinball gradients. Muon, meanwhile, removes Adam-style second-moment adaptation and orthogonalizes momentum updates. That can help on smooth losses, but for pinball loss it may discard one of the few remaining mechanisms for step-size adaptation.

NorMuon is positioned as the compromise. It keeps the matrix-level orientation benefits of Muon while normalizing rows against an exponential moving average of squared row magnitudes. In plainer language: it tries to stop a few neurons from dominating updates while restoring a variance-like adaptation mechanism, now per neuron rather than per parameter.

This is a good example of mechanism-level evidence even without a clean ablation table in the article body. The optimizer choice is not presented as fashion. It is tied to the loss geometry. When the loss no longer tells the optimizer “how wrong” a prediction is, the optimizer must recover useful scale information elsewhere. Toto 2.0’s recipe makes that design dependency explicit.

The data result quietly attacks a common forecasting assumption

Most readers will expect a general forecasting model to need broad public time-series data during pretraining. Toto 2.0’s base models do not use public forecasting data in pretraining at all. They train on Datadog internal observability metrics and synthetic data.

The larger models see 5.04 trillion data points; the smaller 4m and 22m models see 3.40 trillion. The full mix for the base models is 42.5% observability data and 57.5% synthetic data. Within the observability portion, Toto 2.0 rebalances away from Toto 1.0’s heavy skew toward 10-second metrics. The share of 5-minute-plus data rises from 5.0% to 35.3%, while 10-second data falls from 78.5% to 47.1%.

The synthetic component is also not just generic noise. The paper uses the TempoPFN generation method, which produces nonstationary trends, abrupt changepoints, and long-range dependencies. The business translation is straightforward: if the model only sees clean, polite sequences, it will be startled by reality, which is not known for its good manners.

The important correction is this: “no public pretraining” does not mean “domain-free.” The real-world data comes from observability metrics: CPU utilization, memory, request latency, error rates, and similar infrastructure signals. That is a strong and valuable domain, but still a domain. The paper’s public-benchmark results are impressive partly because the base models generalize from observability plus synthetic data into energy, retail, weather, finance, and other GIFT-Eval domains. But we should not turn this into the stronger claim that public data is unnecessary for all forecasting work.

The paper itself gives the right nuance. Public data is excluded from base pretraining because the proxy-scale hyperparameter sweep found it suboptimal. But public data re-enters during finetuning: the Toto 2.0 2.5B-FT mix includes 45% GIFT-Eval Pretrain, 15% GIFT-Eval train, plus Datadog and synthetic sources. So public data is not useless. It moves from foundation pretraining to downstream adaptation.

That is an important business design pattern: keep the base model broad and controlled; use task-relevant public or proprietary data when adapting. For companies, this is less romantic than “one model to forecast everything,” but cheaper to govern.

u-µP turns five models into a product family

Scaling only matters commercially if it gives users choices. A 2.5B model may be better, but not every use case wants the same latency, memory footprint, or deployment environment. Forecasting systems often need tiers: lightweight models near the edge, medium models for routine decision support, larger models for high-value planning or probabilistic scenario analysis.

The paper’s u-µP pipeline is what makes that tiering credible. Instead of independently tuning each model size, the authors tune a 10m proxy model and transfer the configuration across the five target sizes. The proxy has 12 layers, $d_{model}=256$, and 4 heads. Each sweep trial trains for 30,000 steps. The search space is still large: 17 continuous dimensions plus categorical choices, split into four sequential rounds.

The four rounds follow a useful dependency order:

Search round Likely purpose What it supports What it does not prove
Architecture Find a stable backbone and masking setup PerDimScale, variate-attention placement, and CPM settings are part of the selected recipe Not a universal proof that these choices dominate in every time-series domain
Data mixture Select source proportions empirically Public pretraining data was removable at proxy scale; synthetic and observability mix worked best Not a general theorem that public data harms pretraining
Optimizer Match training dynamics to quantile loss NorMuon and AdamW parameter groups can be tuned into a stable large-model recipe Not a standalone optimizer benchmark
Decay schedule Tune the end of training Linear decay over 10,500 steps worked for the family Not evidence that this schedule is optimal outside this setup

The selected model family then scales embedding dimension, depth, and head count while keeping the head dimension fixed at 64. The 4m model uses $d_{model}=256$, 4 heads, and 4 layers. The 2.5B model uses $d_{model}=2048$, 32 heads, and 48 layers. All train on 4,096-timestep contexts with patch size 32 and 32 variates per sample. The small models converge at 400,000 steps; the larger ones continue improving and train for 600,500 steps.

This is where the paper’s infrastructure contribution matters. u-µP is not merely a training trick. The authors had to make it compatible with production training systems: torch.compile, FSDP2 sharding, data and tensor parallelism, and sequence-length behavior that interacts with KV caching. The upstream unit-scaling library was not enough for their large-scale setting, so Datadog released dd_unit_scaling as a distributed training wrapper.

For business readers, this is a sign that the paper is closer to deployable engineering than academic leaderboard theater. The team had to solve the ugly part: making the scaling recipe survive actual training infrastructure. It is always touching when research meets distributed systems and discovers that tensors have administrative needs.

The benchmark evidence supports scaling, but not every figure has the same job

The paper’s evidence is strong, but the figures play different roles. Mixing them together would overstate some claims and understate others.

Evidence item Likely purpose What it supports What it does not prove
BOOM results Main evidence in observability forecasting Toto 2.0 dominates external foundation models on Datadog-relevant telemetry-style forecasting General superiority on every business time series
GIFT-Eval foundation-model results Main cross-domain evidence The base family generalizes beyond observability and synthetic pretraining That public data is never useful
GIFT-Eval finetuned and FnF ensemble results Adaptation and comparison with leaderboard systems Toto 2.0 is a strong base for fine-tuning and ensembles; the ensemble assigns 39% average weight to Toto 2.0 family models Zero-shot scaling claim, because these results use training splits or ensemble machinery
TIME benchmark results Contamination-resistant robustness evidence Toto 2.0 performs strongly on fresh datasets less likely to be in pretraining corpora Perfect monotonic scaling; 313m edges 1B on some TIME metrics
Inference latency study Implementation and deployment evidence CPM improves long-horizon inference economics versus Toto 1.0 Full end-to-end production latency in every deployment stack
Long-horizon sinusoidal study Exploratory stability extension Larger Toto 2.0 models retain multi-scale structure longer than smaller and prior models Extrapolation to genuinely novel dynamics

This table is not pedantry. It prevents the article from doing what AI commentary often does: take every experiment as equal proof of the broadest possible claim. The strongest claim is that Toto 2.0 shows reliable time-series foundation model scaling under one recipe. The adaptation results show the model is a strong base. The latency results show CPM is operationally meaningful. The long-horizon study shows encouraging stability beyond training context on synthetic multi-scale signals, not supernatural extrapolation.

The TIME result is especially important because it reduces the contamination concern that dogs older forecasting benchmarks. TIME uses fresh datasets and avoids legacy datasets that have circulated through time-series foundation model pretraining corpora. Toto 2.0’s 2.5B model leads on CRPS rank, MASE rank, and CRPS, and the top three models are all from the Toto 2.0 family. But the 313m model leads on MASE and edges the 1B model on both rank metrics. That does not break the scaling argument; it makes it more believable. Real scaling curves have small dents. Marketing curves are the ones that look suspiciously polished.

What this means for business forecasting infrastructure

The business value of Toto 2.0 is not that every company should immediately deploy a 2.5B forecasting model. That would be an impressively expensive way to misunderstand the paper.

The practical message is that time-series foundation models are becoming infrastructure candidates. They can be selected by deployment tier, adapted to domain data, evaluated probabilistically, and integrated into systems where latency matters. That is different from treating them as research demos or as one-off forecasting tools.

A useful business reading separates three layers.

Layer What the paper directly shows Cognaptus inference for business use Remaining uncertainty
Forecast quality Larger Toto 2.0 models generally improve benchmark quality across BOOM, GIFT-Eval, and TIME Model size can become a controllable quality-cost lever in forecasting platforms The optimal size may vary by domain, horizon, and metric
Forecast economics CPM reduces long-horizon forward-pass latency versus Toto 1.0 Foundation forecasters become more plausible inside repeated monitoring and planning loops End-to-end production latency depends on hardware, batching, integration, and post-processing
Adaptation Finetuned and ensemble variants lead the full GIFT-Eval leaderboard Base foundation models may serve as reusable forecasting substrates for specialized enterprise models Fine-tuning recipes, data rights, evaluation discipline, and leakage controls remain critical

Several use cases become more plausible.

First, observability and IT operations. Toto 2.0 is trained heavily on infrastructure metrics and evaluated on BOOM, so this is its most direct business pathway. Forecasting CPU, memory, latency, error rates, and request patterns can support capacity planning, anomaly anticipation, incident triage, and automated alert thresholding. The model does not replace root-cause analysis, but it can improve the forward-looking layer that tells teams where attention is likely to be needed.

Second, probabilistic planning. Quantile forecasts are useful when the cost of being wrong is asymmetric. Retail demand, cloud capacity, logistics, energy load, staffing, and financial liquidity all have different penalties for under- and over-forecasting. A point forecast says “maybe 100.” A quantile forecast says “the 90th percentile is 140.” Businesses tend to prefer the second when the first has previously embarrassed them in front of customers.

Third, model tiering. A 4m or 22m model can support low-cost, frequent, or edge-like deployments. A 313m or 1B model may be suitable for heavier batch planning or internal decision support. A 2.5B model can be reserved for high-stakes probabilistic forecasting or as a base for finetuning. The paper’s five-model family is therefore more business-relevant than a single giant checkpoint, because procurement and deployment teams need menus, not monuments.

Fourth, forecasting-as-a-component in agents. If an operational agent needs to simulate future load, estimate risk over time, prioritize incidents, or schedule interventions, it needs fast forecasts that can be refreshed repeatedly. CPM’s latency behavior is relevant here. Agents do not merely need better predictions; they need prediction calls cheap enough to appear inside loops without turning the workflow into a very expensive meditation exercise.

The boundary: scaling helps, but classical forecasting is not dead

The paper is unusually disciplined about its own boundary conditions. It explicitly says classical statistical methods still have properties foundation models often lack: clean extrapolation on simple signals, appropriate prediction-interval growth under well-specified models, and more predictable behavior on out-of-distribution samples.

The long-horizon synthetic experiment makes this visible. At horizons of 2,048, 4,096, and 8,192 steps on sinusoidal mixtures, the larger Toto 2.0 models retain coherent multi-scale structure much better than smaller models and prior-generation models. The 2.5B model reports Pearson correlations of 0.990, 0.979, and 0.818 across the three horizons. The 1B model reports 0.984, 0.945, and 0.643. Smaller models degrade more quickly. Toto 1.0 and Chronos-2 lose coherence earlier.

That is encouraging, but the authors are clear: this is an illustrative stability test, not proof of extrapolation to genuinely novel dynamics. Even the 2.5B model loses structure at 8,192 steps where a well-fitted seasonal model would extrapolate cleanly. This matters for businesses because many forecasting failures happen outside the comfortable distribution: regime shifts, holidays, outages, shocks, policy changes, product launches, and the usual collection of things that only become obvious after the dashboard is already on fire.

A sensible enterprise forecasting system should therefore treat foundation models as part of a forecasting stack, not as a total replacement for statistical baselines, domain rules, and scenario stress tests. Classical models remain useful for interpretable seasonal extrapolation, sanity checks, and stable baselines. Foundation models add transfer, multivariate context, and broader pattern recognition. The point is not to crown a monarch. It is to build a cabinet that occasionally reads the minutes.

The real strategic lesson is controllable scaling

The accepted reading of Toto 2.0 should not be “time-series AI had its GPT moment, please buy GPUs.” It should be more specific.

Toto 2.0 shows that time-series foundation model scaling can be made more reliable when the whole system is designed for scaling: decode future spans in parallel, output quantiles instead of fragile mixture parameters, choose an optimizer that fits the loss geometry, train on a deliberately selected data mixture, and transfer hyperparameters from a proxy model rather than re-solving the training problem at every size.

For AI product teams, this is the strategic pattern to copy. Not necessarily Toto 2.0’s exact recipe. The exact recipe is tied to Datadog’s data, observability priorities, and training infrastructure. The transferable lesson is the dependency chain.

A forecasting model is not just a model. It is a stack of choices:

  1. What prediction protocol does the model train under?
  2. What uncertainty representation does the output layer produce?
  3. Does the optimizer match the loss geometry?
  4. Is the data mixture selected by habit, intuition, or empirical sweep?
  5. Can hyperparameters transfer across model sizes?
  6. Does inference latency allow the model to live inside real workflows?
  7. Are benchmark results separated from adaptation, robustness, and exploratory evidence?

Toto 2.0 answers these questions with enough engineering detail to make the scaling claim credible. That is why the paper matters. Not because bigger is newly magical, but because scaling has become structured enough to become a product decision.

Forecasting will still punish overconfidence. It has a hobby. But after Toto 2.0, the serious question changes. It is no longer whether time-series foundation models can scale at all. It is which organizations can turn scaling into governed, evaluated, latency-aware forecasting infrastructure before everyone else discovers that their “AI forecast” is just a larger spreadsheet with better branding.

Cognaptus: Automate the Present, Incubate the Future.


  1. Emaad Khwaja et al., “Toto 2.0: Time Series Forecasting Enters the Scaling Era,” arXiv:2605.20119, 2026. https://arxiv.org/abs/2605.20119 ↩︎