Forecasting With a Spine: How Semantic Anchors Might Fix Time-Series LLMs
Forecasting looks simple until the spreadsheet starts moving.
A retailer wants next month’s demand. A grid operator wants tomorrow’s load. A finance team wants exchange-rate exposure. In each case, the raw material is not language. It is a jagged sequence of numbers: trend, seasonality, shocks, noise, reporting quirks, holiday distortions, and the occasional data pipeline accident wearing a fake moustache.
This is why “just use an LLM” remains a fragile answer for time-series forecasting. Large language models are good at abstraction, but raw numerical sequences do not arrive with the kind of semantic scaffolding that language models were trained to exploit. A sentence tells you its own grammar. A sales curve does not politely introduce its trend component before handing you the residual.
The STELLA paper attacks this mismatch directly.1 Its central claim is not that an LLM can magically forecast time series because numbers can be chopped into tokens. The sharper claim is that LLM-based forecasting improves when the model is given a structured semantic interpretation of the series: what kind of dataset this is, what behavior the trend shows, what seasonality appears, and what the residual is doing after the obvious structure has been stripped away.
That is why the interesting part of the paper is not merely the benchmark table. The interesting part is the spine it builds before the LLM ever touches the forecast.
The real problem is not tokenization; it is information poverty
Most LLM-for-time-series methods have to solve a conversion problem. The model expects tokens; the data gives numbers. A common answer is patching: split the series into chunks, embed those chunks, and feed them into a language-model-like backbone. That is useful, but it is also crude. A patch of raw values may contain several overlapping signals at once: a local trend, a seasonal rhythm, an irregular jump, and measurement noise. The model receives a compressed mixture and is asked to infer the recipe.
Prompting methods try to help, but STELLA argues that many prior prompts are still too static. If a method retrieves a semantic anchor from a pre-defined pool, it may attach a useful label, but the label is still correlational. It says, roughly, “this series looks similar to something associated with this textual concept.” It does not necessarily explain the behavior of this particular instance.
That difference matters. A business forecast is rarely wrong because nobody used a big enough model. It is often wrong because the model was not told which structure should matter.
STELLA reframes the forecasting pipeline around two missing information types:
| Missing information | What it means in forecasting | STELLA’s answer |
|---|---|---|
| Supplementary information | Global context independent of the current window, such as dataset domain or sampling frequency | Corpus-level Semantic Prior (CSP) |
| Complementary information | A different interpretation of the same observed window, such as slope, periodicity, and component behavior | Fine-grained Behavioral Prompt (FBP) |
This is a useful conceptual split. The CSP tells the model what world it is in. The FBP tells the model what this particular time window is doing. One is context; the other is behavior. The LLM gets both, because apparently even artificial intelligence benefits from being briefed before making decisions. Shocking.
STELLA first decomposes the series before asking the LLM to reason
The mechanism begins with an old forecasting instinct: separate the series into trend, seasonality, and residual components.
In simplified form, STELLA works with the idea:
where $\tilde X$ is the normalized input series, $T$ is the trend component, $S$ is the seasonal component, and $R$ is the residual component.
The paper implements this through a Neural STL module rather than a hand-crafted moving-average filter. The design is important because the decomposition happens before attention. In other words, STELLA does not ask the Transformer layers to discover trend, seasonality, and residual structure while also learning cross-variable relationships and forecasting dynamics. It gives the model cleaner subproblems.
This is the first reason the mechanism-first reading is better than a benchmark-first summary. The paper is not just adding a prompt to an existing time-series LLM. It changes the information geometry of the problem. Raw time series are not treated as one undifferentiated stream; they are factorized into behavioral layers.
After decomposition, each component goes through a Temporal Convolutional Patch Encoder, or TC-Patch. Patching still appears, but now it is applied after decomposition. That sequencing matters. A patch of a trend component is not the same object as a patch of raw signal. It is less burdened by seasonal and residual interference. Likewise, a seasonal patch can be processed as seasonal behavior rather than as one more anonymous strip of numbers.
The result is a component-factored numerical representation: trend embeddings, seasonal embeddings, and residual embeddings. These become the numerical side of the model’s input.
The semantic anchors turn behavior into guidance
The semantic side comes from the Semantic Anchor Module. This is where STELLA’s article-worthy idea lives.
For each decomposed component, the model extracts behavioral features across the input window. The paper gives examples such as slope and dominant autocorrelation lags. These numerical properties are translated into textual cues: not raw values dumped into a prompt, but compact descriptions of behavior. A continuous slope can become a category such as stable or increasing; periodicity can become a textual description of repeated structure.
Then the text is passed through the frozen tokenizer and embedding layers of the LLM. A cross-attention module distills it into learnable prompt vectors. This produces the Fine-grained Behavioral Prompt.
The CSP is different. It comes from global dataset metadata: domain, sampling frequency, and similar corpus-level context. It is not instance-specific. It acts as a stable prior.
So the final input sequence is not simply:
numerical patches → LLM → forecast
It is closer to:
global semantic prior → component-specific behavioral prompts → decomposed numerical patches → LLM → component forecasts → gated recomposition
A compact diagram helps:
Raw time series
↓
ReVIN normalization
↓
Neural STL decomposition
↓
Trend / Seasonality / Residual
↓
TC-Patch numerical embeddings + Semantic Anchor Module
↓
CSP + FBP-guided LLM backbone
↓
Component-wise forecasts
↓
Gated fusion
↓
Final forecast
This is the “spine” in the title. The LLM is not being worshipped as a universal oracle. It is being constrained, briefed, and guided. Very unfashionable. Also useful.
The LLM is a guided processor, not a standalone forecaster
The paper uses LLaMA2-7B as the base LLM, but only the first six Transformer layers are activated. Fine-tuning is parameter-efficient, using LoRA on selected attention projections and related modules, while much of the model remains frozen. The paper also unfreezes some normalization components and applies LoRA to semantic-anchor and output modules.
That implementation detail changes the business interpretation. STELLA is not a story about sending time-series values into a chatbot. It is a story about building a forecasting architecture around a pre-trained language backbone.
The LLM receives a sequence where semantic anchors are placed before numerical component tokens. Because the backbone is decoder-only and causal, earlier prompt-like tokens can condition how later numerical tokens are processed. The model is therefore not simply reading values; it is reading values under semantic instructions.
After the LLM processes the sequence, STELLA discards the prompt tokens and keeps the output representations corresponding to time-series patches. It decodes forecasts for each component, then uses a gated fusion mechanism to recombine them.
That last step also matters. A simple reconstruction would add the component forecasts back together. STELLA instead learns channel-aware, dynamic weights. The model can adapt how much it trusts trend, seasonality, or residual forecasts for each variable. Operationally, that is closer to how real forecasting teams think: not every product, asset, or sensor is driven by the same mixture of stable trend and irregular shock.
The main evidence says the mechanism is doing useful work
The paper evaluates STELLA across long-term forecasting, short-term forecasting, few-shot learning, zero-shot learning, ablations, sensitivity tests, and embedding visualizations. These tests do not all prove the same thing. Treating them as one giant “it works” pile would be lazy, and we try not to do that before coffee.
| Evidence type | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Long-term forecasting benchmarks | Main accuracy evidence | STELLA performs strongly across common long-horizon datasets | It does not prove enterprise deployment readiness |
| M4 short-term forecasting | Main evidence on shorter horizons and varied frequencies | Gains are not limited to long-horizon ETT-style settings | It does not prove every retail or operations setting will benefit |
| Few-shot tests | Scarce-data generalization test | Semantic guidance appears helpful when training data is limited | It does not replace proper domain validation |
| Zero-shot tests | Cross-dataset transfer test | STELLA transfers well within the tested ETT setup | It is not full cross-industry transfer |
| Ablations | Mechanism attribution | N-STL, TC-Patch, FBP, and CSP each contribute | It does not isolate every possible alternative design |
| Prompt-length sensitivity | Hyperparameter robustness and failure-mode test | More semantic guidance helps only up to a point | It does not give universal prompt lengths |
| UMAP/PCA/t-SNE visualizations | Representation diagnostic | Anchors and components separate in latent space | Visualization is supportive, not causal proof |
In long-term forecasting, STELLA reports the top result in 60 evaluation settings, while the closest competing baseline reaches the best result in only 8 cases. Against PatchTST, the paper reports relative MSE/MAE reductions of 16.06% and 9.63%. Against Crossformer, the reported reductions are much larger: 55.91% and 38.44%. Against GPT4TS, the representative LLM baseline, the paper reports 20.25% lower MSE and 10.87% lower MAE.
The summary table also gives concrete averages. For example, on ETTh1, STELLA reports MSE/MAE of 0.416/0.425, compared with GPT4TS at 0.457/0.450 and PatchTST at 0.463/0.449. On Exchange, STELLA reports 0.339/0.394, while GPT4TS reports 0.371/0.409 and PatchTST 0.390/0.429. On Illness, the gap is more visible: STELLA reports 1.819/0.814 against GPT4TS at 2.623/1.060.
The short-term M4 benchmark is a different kind of check. STELLA reports the best result across all 15 evaluation categories, with average OWA of 0.843 compared with TimesNet at 0.851, N-BEATS at 0.855, TimeLLM at 0.859, and GPT4TS at 0.861. The improvement over TimesNet is about 1% on average OWA. That is not a moon landing. It is still meaningful because M4 is a widely used benchmark and the gains are consistent across categories.
The transfer tests are the business-relevant part, with a boundary
For enterprise readers, the most interesting evidence is not the ordinary full-data benchmark. It is the few-shot and zero-shot behavior.
In the few-shot setting, the model uses only the first 10% of training data on ETT datasets. STELLA reports the best result in 23 out of 40 evaluations and ranks second in 9 more. Compared with TiDE, the paper reports reductions of 7.58% in MSE and 4.85% in MAE. Compared with GPT4TS, it reports 14.40% lower MSE and 7.37% lower MAE.
That is commercially relevant because many business forecasting problems are data-poor in practice. New product launches, new market entries, new warehouses, new pricing policies, and newly instrumented processes rarely come with years of clean history. A method that can use structured semantics to make better use of limited data points toward a practical design principle: add behavioral interpretation before adding model size.
The zero-shot results are even more striking. STELLA is trained on one ETT dataset and evaluated directly on another without target training samples. The paper reports best performance across all 40 evaluation settings, with average MSE/MAE reductions of 24.12% and 15.61% compared with all baselines. Against TimeLLM specifically, it reports 11.79% lower MSE and 4.49% lower MAE.
But here the boundary is important. This is zero-shot transfer within a family of electricity transformer temperature datasets. It is not the same as training on electricity data and deploying on luxury retail demand in Jakarta, freight capacity in Manila, or crypto liquidity at 2 a.m. on a Sunday. The paper supports transfer potential. It does not grant enterprise immunity from validation. Sadly, procurement departments will survive another day.
The ablation table says the spine is not decorative
Ablation studies are where many architecture papers quietly confess whether the proposed modules actually matter. STELLA’s ablation study removes four components one at a time: Neural STL, TC-Patch, FBP, and CSP.
The pattern is useful. Removing Neural STL causes the largest performance degradation, especially on Exchange. In the ablation table, STELLA’s Exchange average MSE/MAE is 0.339/0.394. Without Neural STL, it becomes 0.371/0.416. Without TC-Patch, it becomes 0.361/0.405. Without FBP, it becomes 0.343/0.398. Without CSP, it becomes 0.352/0.401.
This tells us something precise. The semantic prompts help, but they sit on top of a structural representation layer. If the model does not first separate the signal into more manageable components, the prompt machinery has less clean behavior to describe. The business version is simple: explanation layers are only useful when the underlying data representation is not a swamp.
The FBP and CSP ablations also separate two kinds of value. Removing FBP hurts because the model loses instance-specific behavior. Removing CSP hurts because the model loses global context. Neither is a magic wand. Together, they form a guided interpretation layer that appears to improve the LLM’s use of numerical embeddings.
Prompt length is a real design constraint, not a place to dump prose
The sensitivity analysis is one of the paper’s more useful practical sections. It tests the lengths of the Fine-grained Behavioral Prompt and Corpus-level Semantic Prior.
The result is non-monotonic. More semantic guidance helps only up to a point. The paper identifies an effective FBP range of 12–24 and a CSP range of 10–20. Beyond those ranges, performance declines, likely because longer prompts introduce redundancy, attentional noise, or overfitting to spurious textual details.
This is a welcome antidote to the enterprise habit of treating prompts like a landfill for every business rule anyone remembered from a meeting.
For production systems, the implication is clear: semantic conditioning should be engineered, measured, and version-controlled. A prompt layer for forecasting should not be a poetic essay about the company’s operating philosophy. It should be a compact representation of domain context and observed behavior. Forecasting models need signal. They do not need your annual report in miniature.
The visualizations support alignment, but they are not the main proof
The paper includes UMAP visualizations showing that final component representations form separated clusters, and that the hierarchical semantic anchors also form distinct clusters. Appendix visualizations using PCA and t-SNE further support this separation.
These figures are useful as representation diagnostics. They suggest that STELLA’s semantic anchors are not collapsing into one generic prompt vector. Trend, seasonality, residual behavior, CSP, and FBP appear to occupy meaningfully different regions in representation space.
But the visualizations should not be overread. Clustering does not by itself prove that the model forecasts better for the right reason. The stronger evidence comes from the combination of benchmark performance, ablations, and sensitivity tests. The visualizations make the mechanism plausible; the tables make it harder to dismiss.
What this means for enterprise forecasting systems
The business relevance of STELLA is not “replace your forecasting team with an LLM.” That conclusion is both unsupported and, frankly, a bit too LinkedIn.
The better lesson is architectural. STELLA suggests that foundation models become more useful for quantitative tasks when they are surrounded by a domain-specific interpretation layer. In forecasting, that layer has three jobs:
- Decompose the raw signal into components that map to recognizable temporal behavior.
- Convert those behaviors into compact semantic guidance.
- Condition a large pre-trained model on that guidance while preserving numerical structure.
For enterprise automation, this points to a broader pattern. LLMs are not best used as floating brains above the data stack. They are better used as reasoning components inside structured pipelines.
A demand forecasting implementation inspired by STELLA would not simply send sales history to an LLM. It would first extract trend, seasonality, promotion effects, stockout flags, holiday shifts, and abnormal residual events. It would convert those into controlled semantic anchors. Then it would let the model forecast with context.
A finance implementation would need different anchors: volatility regime, drift, liquidity condition, macro calendar, structural breaks, and residual shock behavior. An energy implementation would need weather-sensitive seasonality, load-cycle structure, grid events, and abnormal usage patterns.
The method does not remove domain engineering. It changes where domain engineering goes. Instead of hard-coding everything into a traditional model or pretending an LLM can infer everything from raw tokens, the semantic layer becomes the bridge.
What the paper directly shows, and what we should infer carefully
STELLA directly shows that its architecture performs strongly on the paper’s chosen public benchmarks, including long-term forecasting, M4 short-term forecasting, few-shot ETT tests, and zero-shot ETT transfer. It directly shows that removing Neural STL, TC-Patch, FBP, or CSP hurts performance in the tested ablations. It directly shows that prompt length has an optimal range rather than a simple “more is better” curve.
Cognaptus can reasonably infer that semantic abstraction layers may be valuable in business forecasting systems where data is scarce, domain context matters, and interpretability is useful. The paper’s design supports a practical pattern: convert numerical behavior into structured semantic cues before invoking foundation-model reasoning.
What remains uncertain is deployment behavior outside benchmark conditions. The paper does not prove that STELLA handles messy enterprise data pipelines, delayed reporting, causal interventions, promotion calendars, changing product taxonomies, or adversarial market regimes. It also uses substantial compute: the experiments are reported on clusters with multiple NVIDIA L20 and A800 GPUs. That does not make the method impractical, but it does mean the ROI discussion must include training and inference cost.
The zero-shot evidence is promising, but the tested zero-shot setting stays within ETT datasets. Real businesses often require transfer across domains with different variable semantics, different sampling logic, and different causal drivers. That is a harder problem.
Finally, STELLA improves forecasting accuracy, but it does not solve decision quality by itself. A forecast is not a business action. Inventory policy, capacity planning, pricing, hedging, and staffing still require objective functions, constraints, risk preferences, and accountability. The model gives better foresight; management still has to avoid doing something silly with it. A regrettably persistent bottleneck.
The takeaway: LLM forecasting needs structure before scale
STELLA’s most useful contribution is not that it makes another model score better on another benchmark table. It gives a cleaner answer to a deeper question: how should LLMs participate in time-series forecasting at all?
The answer is not raw numerical tokenization. It is not static prompt retrieval. It is not a chatbot with a calculator taped to its forehead.
The answer is structured semantic guidance. First, decompose the series. Then describe the behavior. Then give the LLM a compact global prior and instance-level prompt. Then let the model process numerical embeddings under that guidance. Then recombine the outputs with adaptive weights.
That architecture is more complicated than “ask the LLM.” It is also more credible.
For businesses, this is the practical lesson: foundation models will not automatically understand operational time. They need a spine made of decomposition, semantic abstraction, and domain-aware conditioning. Once they have it, they may become useful forecasting components rather than expensive autocomplete engines staring at a spreadsheet.
Cognaptus: Automate the Present, Incubate the Future.
-
Junjie Fan, Hongye Zhao, Linduo Wei, Jiayu Rao, Guijia Li, Jiaxin Yuan, Wenqi Xu, and Yong Qi, “STELLA: Guiding Large Language Models for Time Series Forecasting with Semantic Abstractions,” arXiv:2512.04871, submitted December 4, 2025, https://arxiv.org/abs/2512.04871. ↩︎