Forecast First, Ask Later: How DCATS Makes Time Series Smarter with LLMs

TL;DR for operators

Forecasting teams usually ask the same question first: which model should we use? DCATS suggests a more operationally useful question: which related histories should this model learn from?

The paper introduces DCATS, a Data-Centric Agent for Time Series, an LLM-agent framework that improves forecasting by selecting auxiliary time series for fine-tuning rather than by designing a new forecasting architecture.¹ In the authors’ traffic forecasting study, GPT-4 Turbo reads metadata about nearby or similar California traffic sensors, proposes candidate neighbour sets, lets lightweight forecasting models test those proposals, and then refines the next round using validation error.

The main result is not that an LLM has become a mystical traffic prophet. Calm down. The LLM is not directly forecasting the next rush hour. It is acting as a data curator: choosing which other time series should be included when adapting a forecasting model to a target sensor.

Across 60 randomly selected traffic-location queries, four forecasting models, three metrics, and a 12-step-ahead forecasting task, DCATS reports an average error reduction of about 6%. The strongest practical reading is this: in domains with many comparable entities and useful metadata — stores, sensors, routes, machines, accounts, hotels, warehouses, payment terminals — an agentic system may improve forecasts by automating the tedious work of selecting relevant peer histories.

The boundary is just as important. The evidence is preliminary, tested on one metadata-rich traffic dataset, with short-horizon forecasts, GPT-4 Turbo as the agent, and no broad ablation showing exactly how much improvement comes from LLM reasoning versus simpler neighbour-selection heuristics, anomaly filtering, or the fine-tuning setup. Useful signal, not a coronation ceremony.

Forecasting often breaks before the model gets a vote

A forecasting model does not begin with a clean philosophical problem. It begins with a messy operational question.

A traffic sensor in Campbell needs a forecast. A store in Makati needs next week’s demand. A hotel near Clark wants to estimate room bookings around a local event. A payments team needs transaction volume for a merchant with patchy history. In each case, the model is not merely asking, “What happened here before?” It is also asking, “What other histories are relevant enough to learn from?”

That second question is usually handled with a mixture of analyst judgement, similarity rules, inherited pipeline logic, and mild spreadsheet despair. DCATS puts an LLM-agent into that gap.

The cleverness of the paper is not that it proposes another large forecasting architecture. It does almost the opposite. It starts from the observation that recent lightweight time series models can be surprisingly competitive when the data is right. So the authors shift the AutoML problem away from “search the model space” and towards “construct a better training subset.”

That is a more interesting move than it first appears. Model search is glamorous. Data selection is where the bodies are buried.

DCATS treats metadata as an operating surface

DCATS begins with a collection of univariate time series and a metadata database. In the traffic setting, the time series are traffic-volume readings from sensors. The metadata includes location coordinates, county, freeway name, number of lanes, and augmented fields such as city name, city population, and historical traffic volume.

For a target location, the agent receives information about the target and a set of possible neighbours. These neighbours are not just “nearby” in one crude sense. The paper describes three ways neighbours are selected:

Neighbour basis	What it captures	Why it may matter
Road network distance	Structural traffic proximity	Sensors on related road segments may share congestion dynamics.
Local correlation / temporal pattern similarity	Behavioural similarity	Two locations can move similarly even when they are not geographically adjacent.
Geodetic distance	Physical proximity	Nearby sensors may share local demand, weather, commuting, or event effects.

This is the paper’s central mechanism. The LLM is asked to reason over these metadata-rich neighbour sets and propose sub-datasets for training. The target may benefit from neighbours on the same freeway, from sensors in similar cities, from locations with similar historical volumes, or from locations whose patterns correlate strongly.

A conventional pipeline might hard-code one of those rules. DCATS asks an agent to combine them, explain the combination, and then revise the choice after seeing validation performance.

The analogy to retrieval-augmented generation is useful, but only if we do not stretch it until it snaps. In RAG, retrieved documents provide context for generation. In DCATS, retrieved neighbour time series provide training context for fine-tuning. The “retrieval” target is not text; it is relevant historical behaviour.

The loop is propose, test, refine, stop

The DCATS workflow is deliberately simple. That is a compliment.

First, the user asks for a forecasting model for a target time series. The LLM-agent reads the target metadata and the available neighbour metadata. It then generates five proposals. Each proposal contains a list of neighbour location_ids and a short explanation of why those neighbours might help.

Second, a forecasting module evaluates each proposal. It trains or fine-tunes a forecasting model using the proposed sub-dataset, measures validation performance, and reports the result back to the agent.

Third, the agent sees the ranked proposal results and generates a new round. It is instructed to analyse the successful combinations, consider characteristics of the target and its neighbours, and minimise Mean Absolute Error. The loop continues until the current round no longer improves on the best-so-far proposal.

That gives DCATS a useful operational shape:

Stage	Agent role	Forecasting module role	Operational meaning
Initial proposal	Converts metadata into candidate neighbour sets	Waits	Automates first-pass analyst judgement.
Evaluation	Waits	Tests each proposal on validation data	Keeps the agent honest with measured error.
Refinement	Uses validation results to propose better subsets	Waits	Turns metadata reasoning into an iterative search.
Stopping	Stops when proposals no longer improve	Confirms no validation gain	Prevents endless agentic tinkering, otherwise known as “innovation theatre.”

The important design choice is that the LLM does not get to declare victory by sounding plausible. Its proposals must survive validation. That is the correct place for an LLM-agent in a forecasting workflow: not above the metric, not outside the pipeline, but inside a controlled evaluation loop.

The model is not the hero, which is precisely the point

The paper tests DCATS with four forecasting models: Linear, MLP, SparseTSF, and UltraSTF. These are not presented as new inventions of this paper. They are the forecasting engines through which the data-selection strategy is tested.

The forecasting module also includes two details that matter for interpretation.

First, the authors pretrain a foundation model using all available time series before user queries. DCATS then uses the agent-proposed sub-dataset to fine-tune that model for different proposals. So the comparison is not “tiny model with five neighbours versus no historical knowledge.” The system already has broad pretraining; DCATS affects the adaptation stage.

Second, the module removes the 10% most anomalous data from the proposed sub-dataset using discord-based anomaly detection before fine-tuning. This is a sensible engineering choice, but it complicates the clean story. DCATS is not only “LLM chooses neighbours.” It is an agent-driven data-selection loop embedded in a forecasting module that also uses pretraining and anomaly filtering.

That does not weaken the paper. It sharpens what should be credited. The contribution is a framework for data-centric AutoML in time series forecasting, not a proof that verbal reasoning alone creates forecast accuracy from thin air.

Thin air remains, as usual, a poor data source.

The main evidence is the performance table

The authors evaluate DCATS on LargeST, a large-scale traffic dataset with 8,600 California sensors. Sensor readings are aggregated into 15-minute intervals, giving 96 intervals per day and 35,040 time steps. The data is split into training, validation, and test sets with a 6:2:2 ratio. The experiment uses 60 randomly selected target locations, forecasting the next 12 intervals for each sensor at each timestamp. Since each interval is 15 minutes, this is a three-hour forecasting horizon.

The reported metrics are MAE, RMSE, and MAPE. The performance table is the paper’s main evidence.

Method	MAE	RMSE	MAPE
Linear	37.31	74.01	15.12%
Linear + DCATS	35.91	72.91	14.20%
Improvement	3.77%	1.48%	6.14%
MLP	34.07	67.68	13.33%
MLP + DCATS	31.26	63.34	12.34%
Improvement	8.26%	6.41%	7.42%
SparseTSF	37.92	74.19	16.83%
SparseTSF + DCATS	34.88	69.20	15.46%
Improvement	8.02%	6.73%	8.14%
UltraSTF	29.77	60.92	10.26%
UltraSTF + DCATS	28.61	57.78	9.79%
Improvement	3.91%	5.16%	4.55%

Two interpretations matter.

The optimistic interpretation is straightforward: DCATS improves all four tested forecasting models across all three metrics. That consistency is more interesting than any single number. If the gain only appeared on one model or one metric, it would look fragile. Here, the reported effect is directionally stable.

The restrained interpretation is equally necessary: the average improvement is about 6%, not a revolution in forecasting. It is useful because forecasting errors compound into operational decisions, but it is not a licence to replace the forecasting team with a chatbot and a motivational poster.

UltraSTF remains the strongest tested model overall, both before and after DCATS. Its MAE drops from 29.77 to 28.61, and its MAPE drops from 10.26% to 9.79%. MLP and SparseTSF show larger percentage improvements, but from weaker starting points. So DCATS appears to help both stronger and weaker models, though the operational value depends on the baseline already in production.

The experiment components serve different purposes

Not every result in the paper should be interpreted the same way. Some parts are main evidence. Some are implementation details. Some are interpretability illustrations. Mixing them together produces the usual AI-paper fog machine.

Paper component	Likely purpose	What it supports	What it does not prove
Four-model performance table	Main evidence	DCATS improves tested models on the LargeST setup.	That the same gain will hold across all forecasting domains.
60 randomly selected target locations	Main evidence design	Results are not based on a single hand-picked sensor.	That performance is stable across every traffic regime or geography.
Three neighbour types	Method design	The agent has multiple metadata views to reason over.	That all three neighbour types are equally necessary.
Pretraining on all time series	Implementation detail	Speeds convergence and gives models broad context.	That DCATS alone creates the full forecasting capability.
Removing 10% anomalous data	Implementation detail	Adds data cleaning before fine-tuning.	That neighbour selection is the only driver of improvement.
Word cloud of best-proposal explanations	Exploratory interpretation	Best proposals mention road, pattern, and geodetic concepts.	That the LLM has causal understanding of traffic dynamics.
Three individual proposal examples	Interpretability illustration	Different queries benefit from different neighbour-selection rationales.	That explanation quality guarantees forecast quality.

The word cloud deserves special caution. It shows that terms related to road networks, traffic patterns, and geodetic information appear prominently in the explanations associated with best proposals. That is useful because it suggests the agent is not blindly selecting neighbours from one criterion.

But a word cloud is not a reasoning audit. It tells us what language appeared in successful explanations. It does not prove the agent understood the causal structure of California traffic. It is an interpretability clue, not a philosophical breakthrough. The paper uses it appropriately as supporting evidence for the agent’s proposal behaviour, not as a second thesis.

The misconception: this is not a better forecasting architecture story

The natural lazy summary is: “LLMs improve time series forecasting.” That is true only in the least useful sense.

A better summary is: “An LLM-agent can improve time series forecasting by selecting more useful auxiliary training series from metadata-rich candidate neighbours, then validating and refining those choices.”

That distinction matters because it changes what a business should try to implement.

If the reader thinks the LLM is the forecaster, they will look for a conversational model that predicts demand, traffic, revenue, or transactions directly. That is usually a bad idea. Forecasting needs controlled data splits, evaluation metrics, leakage discipline, and repeatable baselines. A fluent answer is not a forecast.

If the reader understands DCATS correctly, the LLM is a planning layer around the forecasting system. It proposes candidate training subsets. The forecasting module judges them. The agent revises. The metric decides.

That architecture is much more plausible for enterprise use. It keeps the stochastic, language-based component away from the final numeric claim and uses it where language models are actually useful: interpreting metadata, forming hypotheses, and navigating a large combinatorial search space that humans would rather not inspect by hand.

The business value is automated peer selection

The obvious business translation is not “use this exact system for traffic.” It is broader and more conditional.

Many organisations forecast across a large number of related units:

stores forecasting SKU demand;
hotels forecasting occupancy by property and segment;
logistics teams forecasting route volumes;
energy firms forecasting load across substations;
banks forecasting transactions by merchant or branch;
manufacturers forecasting machine-level failures or throughput;
platforms forecasting user activity across regions.

In those settings, each unit has its own history, but the history may be too short, noisy, sparse, or locally idiosyncratic. The practical question becomes: which other units are similar enough to help?

That is where DCATS points to a useful design pattern.

What the paper directly shows	Cognaptus business inference	Boundary
Metadata-guided neighbour selection improves tested traffic forecasts.	Businesses with many comparable entities can use agents to curate peer groups for forecasting.	Requires rich, reliable metadata and enough comparable entities.
Validation feedback refines the agent’s proposals.	Agentic forecasting workflows should be metric-gated, not explanation-gated.	Validation design must prevent leakage and overfitting to the validation set.
Gains appear across four lightweight models.	Data-selection layers may improve an existing modelling stack without replacing it.	Production baselines may already use strong domain-specific selection rules.
Best proposals draw on different neighbour criteria.	Different target units may require different similarity logic.	The paper does not prove which similarity criteria generalise beyond traffic.
Explanations accompany proposals.	Human teams can inspect why a peer group was chosen.	Explanations are useful for review, not proof of correctness.

The ROI logic is also more subtle than “6% better forecasts.” In production, the value may come from reducing analyst workload, shortening model-customisation cycles, and making peer-selection logic more adaptive. Accuracy is only one part of the case. The other part is operational scale.

A human analyst can manually inspect related time series for a few high-value targets. They cannot do it gracefully for thousands of sensors, stores, merchants, or machines every week. They will produce a process, and then that process will ossify. DCATS is interesting because it keeps the peer-selection process alive.

Where DCATS would fit in a real forecasting stack

A production version of this idea would not sit at the end of the forecasting pipeline, writing numbers into dashboards like an overconfident intern. It would sit upstream, close to feature stores, metadata catalogues, validation infrastructure, and model-training workflows.

A practical architecture might look like this:

Maintain a metadata catalogue for all forecastable entities.
Generate candidate similarity sets using rule-based, statistical, spatial, behavioural, or embedding-based methods.
Let an agent propose training subsets using the target profile and candidate neighbours.
Evaluate each proposal through the standard forecasting module.
Record validation and test performance, selected neighbours, explanations, and data versions.
Promote only metric-validated configurations into production.
Monitor drift, and rerun selection when entity behaviour changes.

The most important dependency is not the LLM. It is the metadata. DCATS works because the agent has something meaningful to reason over: freeway, lane count, city, county, population, historical volume, spatial proximity, network distance, and temporal similarity.

In business domains, the equivalent might be store format, customer segment, product category, local event calendars, fulfilment mode, merchant type, tenure, geography, pricing regime, machine class, or usage pattern. If the metadata is poor, stale, inconsistent, or politically curated — always a charming possibility — the agent will reason over nonsense with excellent grammar.

The limitation section should not be decorative

The paper is clear that this is a preliminary study. That word is doing real work.

First, the evaluation uses one dataset: LargeST traffic volume data from California. Traffic sensors are a favourable environment for DCATS because there are many related time series and rich spatial metadata. The method may transfer to other domains, but transfer is not shown.

Second, the forecast horizon is short: the next 12 intervals, or roughly three hours. That is a different problem from weekly demand planning, quarterly revenue forecasting, or long-horizon capacity planning. Short-horizon traffic has strong local and temporal structure; other domains may be less forgiving.

Third, the agent is GPT-4 Turbo. The paper does not establish whether smaller, cheaper, open, or domain-specific models would produce similar proposal quality. For businesses, this matters because an agent that runs many proposal rounds across many targets can become a cost and governance object, not just a clever diagram.

Fourth, the paper does not provide a full ablation separating the contribution of LLM-based proposal generation from simpler alternatives. A useful next test would compare DCATS against strong non-LLM neighbour-selection rules: top-$k$ temporal correlation, top-$k$ geodetic proximity, road-network-only selection, random neighbour selection, and hybrid heuristics. Without that, the result supports the DCATS framework, but not a precise attribution of the 6% gain to “LLM reasoning” alone.

Fifth, the forecasting module includes anomaly removal and pretraining. Both are reasonable. Both also mean that the system’s performance should be interpreted as the result of a combined pipeline.

None of these limitations make the paper uninteresting. They make the paper usable. The worst way to read preliminary research is to inflate it into a universal product claim and then act disappointed when reality declines the invitation.

What operators should do with the idea

The right response to DCATS is not to rebuild the paper tomorrow morning. It is to inspect whether your forecasting problem has the structure DCATS needs.

Ask four questions.

First: do we forecast many related entities, or only a few isolated series? DCATS is designed for settings where auxiliary histories can help a target forecast. If every series is unique and unrelated, neighbour selection has little room to matter.

Second: do we have metadata that actually explains similarity? Names and IDs are not enough. Useful metadata describes structure, behaviour, geography, category, capacity, customer type, operating context, or historical patterns.

Third: do we have validation infrastructure that can judge agent proposals automatically? Without fast evaluation, the agent becomes a brainstorming assistant, not an AutoML component.

Fourth: are our current peer-selection rules crude enough to improve? If the organisation already has mature, domain-specific similarity logic, DCATS must beat that, not a straw baseline. A decent heuristic is a more serious opponent than a bad neural network paper usually admits.

If those conditions are met, the DCATS pattern is worth prototyping. Start small. Pick one forecasting domain with many entities. Build candidate neighbour sets using existing similarity rules. Let an agent propose subsets. Evaluate against current production logic. Log everything. Make the metric the adult in the room.

The useful future is data-centric AutoML, not agent theatre

DCATS sits in a promising part of the agentic AI landscape because it gives the agent a constrained job.

It does not ask the LLM to become an oracle. It does not ask the LLM to hallucinate a forecast and call it insight. It asks the LLM to reason over structured metadata, propose a data selection plan, and then accept judgement from a validation metric. This is a healthier division of labour than most agent demos, where the agent is asked to do everything except apologise when the spreadsheet catches fire.

The paper’s deeper message is that AutoML for time series may need to move closer to the data. Model selection and hyperparameter tuning still matter, but in many operational environments, the larger opportunity may be deciding which histories belong together.

That is not as glamorous as inventing a new architecture. It is more useful.

Forecasting systems do not become smarter merely by adding larger models. They become smarter when they learn what context to trust, what history to borrow, and when a neighbouring signal is a helpful guide rather than statistical gossip.

DCATS is an early, narrow, and imperfect step in that direction. Which is exactly why it is worth paying attention to.

Cognaptus: Automate the Present, Incubate the Future.

Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, and Yan Zheng, “Empowering Time Series Forecasting with LLM-Agents,” arXiv:2508.04231, 2025, https://arxiv.org/abs/2508.04231. ↩︎

TL;DR for operators#

Forecasting often breaks before the model gets a vote#

DCATS treats metadata as an operating surface#

The loop is propose, test, refine, stop#

The model is not the hero, which is precisely the point#

The main evidence is the performance table#

The experiment components serve different purposes#

The misconception: this is not a better forecasting architecture story#

The business value is automated peer selection#

Where DCATS would fit in a real forecasting stack#

The limitation section should not be decorative#

What operators should do with the idea#

The useful future is data-centric AutoML, not agent theatre#