Squeezing Time: How Dynamic Tokenization Could Reshape Time‑Series Foundation Models

Forecasting systems have a bad habit: they treat every moment in the past as if it deserves the same amount of attention.

A quiet hour in an electricity-load curve. A sudden machine vibration spike. A slowly drifting weather signal. A crypto candle that does nothing for three hours and then ruins someone’s afternoon. To a standard point-wise time-series model, each timestamp is a token. To a fixed-patch model, every group of timestamps is compressed with the same ruler. Both choices are defensible. Both are also slightly lazy.

The paper behind TimeSqueeze asks a simple question with expensive consequences: why should smooth, repetitive regions and volatile, information-dense regions consume the same Transformer budget?¹

That question matters because time-series foundation models are increasingly asked to do what earlier forecasting systems did not: generalize across domains, ingest long historical contexts, and support many forecast horizons. The more history they see, the more memory and compute they burn. The usual response is patching: compress consecutive timestamps into fewer tokens before sending them to the Transformer. Effective, yes. But fixed patching assumes the same compression rate should apply everywhere inside the sequence. History is apparently made of equal-sized bricks. Convenient. Not always true.

TimeSqueeze proposes a more selective tokenizer. It first reads the raw sequence at full resolution using a lightweight state-space encoder, then dynamically chooses patch boundaries according to local signal change. Short patches are used where the signal changes quickly; longer patches are used where the signal is smooth or redundant. The result is not merely “fewer tokens.” It is a different allocation of model attention.

That is the core business relevance: long-context forecasting does not only need bigger models. It needs a smarter token budget.

The real bottleneck is not forecasting; it is how much past you can afford to read

Time-series forecasting has always had a practical tension between context and cost. More context can improve forecasting because many systems are seasonal, delayed, path-dependent, or regime-sensitive. But in Transformer-style architectures, longer input sequences become expensive. Feed the model more history, and the bill arrives in GPU memory, latency, and training time.

The paper frames the tokenizer as the place where this trade-off is negotiated. A tokenizer decides how raw observations become model inputs. In time-series forecasting, two broad families dominate:

Tokenization choice	What it preserves	What it saves	What it risks
Point-wise tokens	Fine-grained temporal detail	Little	High cost for long contexts
Fixed-size patches	Compute and memory	A lot	Blurring local dynamics and imposing arbitrary boundaries
Dynamic patches	Detail where the signal changes, compression where it does not	Potentially a lot	Requires good boundary selection

Point-wise tokenization is the cautious accountant: keep every record, just in case. Fixed patching is the cost-cutter: group everything into uniform blocks and hope no important transition was chopped in half. TimeSqueeze tries to be the operations manager: spend attention where the process is actually moving.

This distinction is easy to miss. Many readers will see “patching” and think of it as a cheaper downsampling trick. The paper’s actual claim is narrower and more interesting: compression should adapt within a sequence, because information density is not constant across time.

A retail demand series may sit flat overnight, jump during a promotion, and drift after a competitor changes price. A logistics delay signal may be stable until a port disruption. A device sensor may carry little useful novelty until vibration changes sharply. In each case, the model should not spend the same representation budget on the quiet region and the transition region.

TimeSqueeze keeps full-resolution reading before it compresses

The key mechanism is a hybrid architecture. TimeSqueeze does not immediately squash the raw series into patches. First, it passes the input through a Mamba-based state-space encoder at native resolution. This matters because compression before feature extraction can throw away details that later layers cannot recover. The encoder produces fine-grained embeddings for every timestamp, using an architecture designed to handle long sequences more efficiently than full attention.

Only after this full-resolution pass does TimeSqueeze decide which embeddings to keep as patch boundary tokens.

A simplified flow looks like this:

Raw time series
   ↓
Full-resolution Mamba / SSM encoder
   ↓
Relative-deviation boundary detector
   ↓
Variable-length patches: short in volatile regions, long in smooth regions
   ↓
Compressed boundary tokens sent to Transformer / MoE backbone
   ↓
Unpatching by repeating boundary representations across their patch spans
   ↓
Mamba / SSM decoder combines compressed context with fine-grained features
   ↓
Multi-horizon forecasting heads

The boundary rule is intentionally simple. TimeSqueeze tracks the absolute difference between consecutive samples and compares it with local average signal power in a lookback window. A new patch boundary is declared when the local change exceeds a threshold scaled by local power. In less formal language: if the series suddenly changes relative to its recent magnitude, start a new patch.

That scaling is important. A $10$-unit move may be huge in one series and irrelevant in another. By using relative deviation rather than raw difference alone, the tokenizer can adapt across signal amplitudes. The paper sets a threshold parameter, $\tau$, and a maximum patch size. In its main configuration, it targets an average compression rate of $4\times$, uses $\tau = 0.3$, caps the maximum patch size at $8$, and reports an average $4\times$ compression ratio on the pretraining dataset.

One design choice is especially practical: TimeSqueeze keeps boundary embeddings rather than averaging all embeddings inside a patch. Then, after the Transformer processes the compressed sequence, it “unpatches” by repeating the updated boundary representation across the patch span and sends it through the decoder. The paper argues that this preserves causal consistency because the boundary represents the start of the patch rather than leaking future within-patch information backward.

So the mechanism is not simply: “compress the input and hope.” It is:

read the original sequence with an efficient sequential encoder;
identify where signal changes justify token boundaries;
send fewer tokens to the expensive Transformer backbone;
reconstruct full-resolution representations with a decoder and residual fine-grained features.

That fourth step is not decorative. Without it, the architecture would be closer to ordinary downsampling. With it, TimeSqueeze becomes a multi-resolution system: local detail is handled by the SSM encoder-decoder, while broader contextual modeling is handled by the compressed Transformer stage.

The evidence is strongest when read as a token-budget argument

The paper evaluates TimeSqueeze mainly by replacing Time-MoE’s point-wise tokenizer with the SSM-based dynamic patching module while keeping the Time-MoE-style backbone and training setup comparable. This is the right experimental framing: it isolates tokenization instead of comparing one entirely different forecasting stack against another and then asking readers to divine which part did the work. A small mercy, but a real one.

The main evidence falls into several categories:

Test	Likely purpose	What it supports	What it does not prove
Zero-shot forecasting on ETT and Weather benchmarks	Main evidence	TimeSqueeze can remain close to Time-MoE while using a compressed representation	Universal superiority across all domains
Full-shot fine-tuning	Main evidence	The compressed tokenizer does not collapse when adapted to benchmark training splits	Production adaptation under messy data pipelines
Efficiency comparison	Main evidence	Dynamic compression lowers memory, training time, and inference cost	Total deployment ROI after engineering overhead
Fixed-patching and component ablations	Ablation	Dynamic boundaries and the SSM encoder-decoder are central contributors	That the exact threshold rule is optimal
Generic Transformer and GiftEvalPretrain tests	Robustness / extension	Gains are not only a Time-MoE artifact	Full generality across all backbones and scales

On zero-shot forecasting, TimeSqueeze is competitive with Time-MoE and other foundation-model baselines. The average MSE across the reported zero-shot benchmark table is $0.346$ for TimeSqueeze base, $0.348$ for TimeSqueeze large, $0.347$ for Time-MoE base, and $0.352$ for Time-MoE large. That is not a cartoonish “new model crushes old model” result. It is more useful than that: TimeSqueeze roughly preserves forecasting quality while changing the compute profile.

The full-shot setting tells a similar but slightly less flattering story, which is exactly why it is worth reading carefully. After one epoch of fine-tuning, TimeSqueeze base reports an average MSE / MAE of $0.327 / 0.360$, while Time-MoE base reports $0.315 / 0.357$. So TimeSqueeze does not beat Time-MoE in average full-shot accuracy. It does, however, outperform the other listed non-Time-MoE baselines in average MSE and remains close enough to Time-MoE that the efficiency trade-off becomes the real question.

That is the correct interpretation. The paper’s business implication is not “accuracy magically improves because patching is dynamic.” It is: if a compressed tokenizer can stay near a point-token foundation model while cutting the token burden, the economic frontier changes.

The efficiency result is where the paper earns attention

The efficiency comparison is the most commercially legible part of the paper. The authors report that TimeSqueeze base, compared with Time-MoE base, achieves comparable performance while reducing memory usage by $3.4\times$ and training time by roughly $20\times$ in the smaller-budget training setting. In a like-for-like larger setting of batch size $1024$ and context length $4096$, the paper reports $2.6\times$ less memory and $2.4\times$ less compute. On inference throughput, TimeSqueeze reaches up to $10.5\times$ faster inference for longer prediction horizons.

This is the type of result that should interest business teams, though not in the breathless way vendor decks will inevitably present it. The immediate lesson is not “forecasting becomes free.” It is that tokenization can be treated as an infrastructure lever.

For an organization running thousands of forecasts per hour, model efficiency changes several operational constraints:

Operational constraint	How dynamic tokenization could help	Practical boundary
GPU memory during pretraining	Fewer Transformer tokens can reduce memory pressure	Still requires large-scale training infrastructure
Batch size and context length	Compression may allow longer histories under the same budget	More context is not always better; the paper finds performance can plateau or degrade beyond a range
Edge or on-device inference	Lower token load can improve throughput and latency	Device deployment also depends on model size, runtime stack, and quantization
Frequent retraining	Better data efficiency can reduce experimentation cost	Dataset quality and domain coverage still dominate many failures
Multi-horizon forecasting	Compressed context may make long-horizon inference more tractable	The paper evaluates point forecasts, not full operational decision systems

The phrase “data efficiency” also needs interpretation. The paper claims up to $8\times$ higher pretraining data efficiency, meaning TimeSqueeze can reach comparable behavior with less training data exposure than the point-token baseline. For businesses, that is not just a training-budget statistic. It suggests that representation design may reduce the amount of historical data and compute required to build useful long-context forecasting systems.

But “may” is doing real work there. Benchmarks are not warehouses, power grids, or trading desks. They are controlled tests. Production systems introduce missing data, calendar artifacts, changing measurement practices, revised ground truth, sensor drift, and business actions that alter the very series being forecast. The paper supports a model-design thesis. It does not replace deployment discipline. Very rude of reality, but consistent.

The ablations show that dynamic patching is not the only ingredient

A common bad summary of TimeSqueeze would be: “dynamic patching beats fixed patching.” True, but incomplete.

The ablation results show that the SSM encoder-decoder is also doing serious work. In the average zero-shot ablation for prediction horizon $96$, TimeSqueeze base reports MSE / MAE of $0.259 / 0.310$. Replacing dynamic patching with fixed patching at patch size $4$ worsens the average to $0.340 / 0.368$. Replacing the SSM mechanism with linear patching is worse still on MSE, at $0.353 / 0.363$.

That tells us two things.

First, dynamic boundaries matter. Fixed patching at the same average scale can discard informative samples because it does not know where the signal is changing. In the ablation, the fixed-patching variants are much worse than the dynamic version on average. This directly supports the paper’s central misconception correction: patching is not only about compression rate; it is also about boundary placement.

Second, the encoder matters. If the model only had a clever boundary detector but weak feature extraction, it would still struggle. The Mamba / SSM encoder creates point-wise representations before compression, and the decoder helps reconstruct richer full-resolution features after the Transformer stage. The tokenizer is therefore a small architecture, not just a preprocessing rule.

The residual fine-grained features and original position IDs also matter, but less dramatically in the reported average ablation. Removing fine-grained features worsens average MSE from $0.259$ to $0.268$; removing original position IDs worsens it to $0.274$. Those are not trivial changes, but they are smaller than replacing dynamic patching or the SSM encoder. The paper’s own reading is sensible: the strongest contribution comes from combining SSM inductive bias with dynamic, context-aware pruning.

Longer context helps, until it starts bringing gossip

The paper’s context-length tests are useful because they prevent a naive conclusion: “If long context is good, more context is always better.”

The authors test inference context lengths from $96$ to $1536$ tokens and report that forecasting improves up to roughly the $512$–$800$ range, then plateaus or slightly degrades. This is important. Time-series history is not a sacred archive. Old observations can be useful, redundant, or actively distracting depending on regime stability and seasonal structure.

TimeSqueeze also tests longer pretraining contexts under a fixed token budget of approximately $50$B tokens while keeping inference context at $512$. The paper reports that longer pretraining contexts improve inference performance even when deployment uses shorter contexts. That is a subtle but practical result: a model can learn broader temporal representations during pretraining without requiring every downstream inference call to carry a huge context window.

For business use, this suggests a two-stage strategy:

expose the model to long histories during pretraining or domain adaptation;
deploy with a shorter, cheaper inference context if performance remains strong.

That would be attractive for edge monitoring, retail demand systems, and financial dashboards where inference volume matters. It also comes with a warning: context length is a hyperparameter, not a religious commitment. More history can become noise with better branding.

The generalization tests are promising, but they are not a second thesis

TimeSqueeze’s main experiments use a Time-MoE-style decoder-only MoE Transformer backbone. The authors therefore add tests to see whether the dynamic tokenizer helps beyond that specific architecture.

One test replaces EntroPE’s entropy-based patching with TimeSqueeze patching in a generic encoder-only non-causal Transformer setup for multivariate forecasting. TimeSqueeze is competitive and beats several state-of-the-art in-distribution forecasting models, including EntroPE on several metrics, though the paper notes that hyperparameters were originally tuned for EntroPE and reused for TimeSqueeze. That caveat matters: the test is supportive, not definitive.

A second extension uses a generic $10$M-parameter decoder-only Transformer and compares dynamic patching with fixed patching. Dynamic patching again wins across the reported benchmark datasets. For example, on ETTh2 it reports $0.280 / 0.347$ MSE / MAE versus fixed patching’s $0.406 / 0.421$; on ETTm2, $0.201 / 0.292$ versus $0.362 / 0.406$.

A third extension uses GiftEvalPretrain with the same $10$M TimeSqueeze variant. Dynamic patching reports an average $0.327 / 0.368$, compared with $0.354 / 0.383$ for fixed-size patching and $0.456 / 0.442$ for point embedding.

These tests are best read as robustness and portability evidence. They support the idea that within-sequence dynamic patching is not merely exploiting a Time-MoE quirk. They do not prove that every future backbone should adopt this exact boundary rule, this threshold, or this encoder-decoder configuration. The distinction matters because research papers often contain “generalization” sections that readers inflate into universal laws before lunch.

What Cognaptus would take from this for applied forecasting systems

The business interpretation is not that every company should immediately rebuild its forecasting stack around TimeSqueeze. The useful takeaway is more architectural: forecasting systems should treat token allocation as part of model design, not as a boring preprocessing detail.

In practice, that means asking different questions during system design:

Design question	Old framing	Better framing suggested by TimeSqueeze
How much history should we feed the model?	Choose a fixed context length	Choose a context strategy and compression mechanism together
How should we patch the series?	Pick a patch size by validation	Let local signal complexity influence patch boundaries
Where should compute go?	Uniformly across all timestamps	Toward transitions, volatility, and information-dense regions
What is the model-efficiency lever?	Smaller model or shorter context	Smarter tokenization before expensive contextual modeling
What should we monitor in production?	Forecast error only	Forecast error plus compression behavior, patch distribution, and regime shifts

That last row is easy to overlook. If a dynamic tokenizer enters production, its behavior becomes part of the forecasting system. Engineers should monitor patch distributions, average compression rates, and whether boundary density changes when business regimes shift. A sudden increase in short patches may itself be a diagnostic signal: the world became noisier, the sensor changed, demand behavior shifted, or the preprocessing pipeline broke. The model may be forecasting, but the tokenizer is also watching.

This creates a useful opportunity for business analytics. Instead of treating model internals as inaccessible machinery, dynamic patch patterns could become an operational lens. Where does the model choose to spend resolution? Which products, locations, machines, or markets generate dense patching? Do those areas correlate with forecast errors, incident reports, promotions, or supply disruptions?

That is an inference from the paper, not a result directly shown by it. The paper evaluates forecasting performance and efficiency. Cognaptus’s extension is that dynamic tokenization may also produce monitoring metadata worth analyzing.

Where the paper’s boundaries matter

The limitations are not fatal, but they affect how the result should be used.

First, the patching rule still has a threshold. The paper’s relative-deviation method is simple and scale-aware, but it requires tuning to achieve a target compression rate. The authors themselves point to learned patch boundaries and variable-rate compression as future directions. For production teams, this means the tokenizer introduces a new control surface. Set the threshold too aggressively, and you may erase useful dynamics. Set it too conservatively, and the efficiency gains fade.

Second, the evaluation is benchmark-centered. The paper uses established long-horizon forecasting datasets, Weather, Time-300B pretraining, GiftEvalPretrain extensions, and controlled ablations. That is appropriate for research evidence. It is not the same as validating performance under enterprise data conditions: irregular reporting, missingness, calendar shocks, exogenous interventions, and changing business rules.

Third, the model focuses on point forecasting in the reported experiments. The authors note that probabilistic forecasting could be supported by replacing the linear projection head with a probabilistic head, but that is not the same as demonstrating calibrated uncertainty. Many business decisions need intervals, risk bands, service-level probabilities, or scenario distributions. A sharper point forecast is useful. A well-calibrated forecast distribution is often what the decision process actually needs.

Fourth, the data mixture matters. Time-300B is large, but its original distribution is heavily skewed toward the Nature domain, with the paper reporting Nature at $90.50%$ of observations before downsampling. The authors downsample the largest Nature datasets to reduce bias, bringing the total from $309$B to $120$B samples. This is a reminder that foundation models inherit their pretraining diet. A model trained mostly on weather-like patterns may still need careful adaptation before being trusted in inventory, payments, or credit-risk workflows.

Finally, the larger TimeSqueeze model is not uniformly better in the reported zero-shot results. The authors suggest limited training budget as a likely reason. That should make readers cautious about assuming scale automatically fixes everything. Sometimes the larger model is just a more expensive way to learn that your training recipe was not ready.

The practical lesson: compress time by respecting its shape

TimeSqueeze is valuable because it sharpens a design principle: efficient forecasting is not only about reducing sequence length; it is about reducing sequence length without flattening the events that carry information.

Fixed patching asks the model to accept a uniform compression schedule. Point-wise tokenization asks the infrastructure team to pay for every timestamp. TimeSqueeze proposes a more selective compromise: read the sequence at full resolution, compress dynamically based on local change, and let the Transformer focus on a smaller but more meaningful set of tokens.

For businesses, the likely near-term impact is not a standalone product. It is a blueprint for cheaper long-context forecasting architectures. Energy demand, logistics, climate-sensitive operations, industrial monitoring, finance, and large-scale retail forecasting all share the same unpleasant arithmetic: useful history can be long, but compute budgets are not infinitely patient.

The best way to read this paper is therefore mechanism-first. The benchmark numbers matter, but they are evidence for a larger point: time-series foundation models should not treat time as a uniform grid of equal informational value. Some moments are quiet. Some moments are transitions. Some moments deserve a token.

The rest can sit down.

Cognaptus: Automate the Present, Incubate the Future.

Sravan Kumar Ankireddy, Nikita Seleznev, Nam H. Nguyen, Yulun Wu, Senthil Kumar, Furong Huang, and C. Bayan Bruss, “TimeSqueeze: Dynamic Patching for Efficient Time Series Forecasting,” arXiv:2603.11352v1, 2026. https://arxiv.org/abs/2603.11352 ↩︎

The real bottleneck is not forecasting; it is how much past you can afford to read#

TimeSqueeze keeps full-resolution reading before it compresses#

The evidence is strongest when read as a token-budget argument#

The efficiency result is where the paper earns attention#

The ablations show that dynamic patching is not the only ingredient#

Longer context helps, until it starts bringing gossip#

The generalization tests are promising, but they are not a second thesis#

What Cognaptus would take from this for applied forecasting systems#

Where the paper’s boundaries matter#

The practical lesson: compress time by respecting its shape#