When LLMs Meet Time: Why Time-Series Reasoning Is Still Hard

Dashboard numbers are seductive because they look obedient. Revenue goes up, traffic dips, latency spikes, inventory turns over, temperature drifts, volatility clusters. Put the sequence into a chart and the pattern seems almost polite.

Then someone asks an LLM what happened.

The model answers fluently. It may even sound like an analyst who has seen too many quarterly review decks and has developed a protective layer of confidence. But fluency is not temporal understanding. A model can describe a curve, name a trend, and still fail to understand which segment comes next, whether a transformation is correct, or whether a discontinuity is an error or a legitimate feature of the process.

That is the useful irritation behind TSAQA: Time Series Analysis Question And Answering Benchmark.¹ The paper does not ask whether LLMs can talk about time series. They obviously can. The question is harsher: when time-series analysis is converted into controlled question-answering tasks, can today’s LLMs reason over temporal structure rather than perform verbal decoration around numbers?

The answer is: sometimes, after tuning, on some tasks. Which is less glamorous than “LLMs understand time,” but considerably more useful.

The first result is not failure; it is uneven competence

The headline number is easy to misuse. In zero-shot evaluation, the best commercial model in the paper, Gemini-2.5-Flash, reaches an overall score of 65.08 on TSAQA. GPT-4.1 reaches 62.82, Claude-3.5-Sonnet 61.19, and GPT-4o 60.73. That is not catastrophic, but it is not the kind of score one wants from a system trusted to interpret production metrics, trading signals, medical telemetry, grid demand, or logistics patterns without supervision.

The more interesting result arrives after instruction tuning. The authors apply LoRA-based supervised fine-tuning to open-source models. LLaMA-3.1-8B rises to 85.26 overall; Qwen3-8B reaches 84.29. Even Gemma3-1B reaches 69.70, surpassing the best zero-shot commercial result.

So the paper is not saying, “LLMs are useless for time series.” That would be tidy, dramatic, and wrong. The evidence says something more operational: time-series QA can be taught, but the learning is uneven, and the hard parts are not where generic AI demos usually look.

Evidence from TSAQA	What it directly shows	Business reading	Boundary
Best zero-shot commercial score: Gemini-2.5-Flash at 65.08 overall	General LLM capability does not transfer cleanly to time-series QA	Do not treat a strong general model as a finished temporal analyst	Zero-shot results do not measure what task-specific tuning or tools can do
Instruction-tuned LLaMA-3.1-8B reaches 85.26 overall	Open-source models can improve substantially with targeted training	Smaller tuned models may be practical for constrained internal workflows	High overall score hides weak pockets, especially PZ and advanced reasoning
Best PZ score remains 67.68	Chronological reconstruction remains difficult	Ordering, process continuity, and event sequencing need extra checks	PZ is a specific probe, not the whole universe of temporal intelligence
Human validation shows 91.2% agreement for characterization and 87.4% for comparison on unambiguous cases	The generated QA labels are reasonably reliable but not perfect	The benchmark is useful, but label noise and ambiguity remain	Comparison tasks are inherently more judgment-sensitive

The uncomfortable lesson is not that models cannot learn temporal tasks. It is that averages are too friendly. They hide where the model is actually brittle.

A common mistake in business discussions is to treat “time-series AI” as one capability. It is not. Forecasting next month’s sales, detecting an anomaly in a sensor stream, comparing two demand curves, and recognizing whether a Fourier transform matches an input series are different cognitive acts. Putting them under one label is convenient. So is putting all office snacks under “nutrition.”

TSAQA is useful because it breaks the problem into six tasks across conventional and advanced analysis:

Task group	Task	What the model must do
Conventional analysis	Anomaly detection	Decide whether the input contains irregular or unexpected behavior
Conventional analysis	Classification	Assign a time series to a semantic or pattern-based class
Advanced analysis	Characterization	Identify properties such as trend, seasonality, dispersion, noise, or stationarity
Advanced analysis	Comparison	Compare two series in shape, similarity, correlation, or related structure
Advanced analysis	Data transformation	Recognize whether a transformed sequence correctly corresponds to the input
Advanced analysis	Temporal relationship	Infer continuation, chronological order, and structural continuity among patches

The benchmark uses three question types: true-or-false, multiple-choice, and puzzling. True-or-false and multiple-choice formats provide controlled evaluation. The puzzling format is the sharper instrument: the model receives an initial patch and four shuffled successor patches, then must reconstruct the chronological order.

This design matters because it blocks the model from escaping into prose. No vague “the series appears to show cyclical behavior” answer. No five-paragraph meditation on seasonality. The model must pick the right answer.

A benchmark that forces choice is less flattering than a chatbot window. That is exactly why it is useful.

The misconception: language wrapping is not temporal reasoning

The easy story is that time-series QA solves the interface problem. Convert numeric data into a natural-language question, ask the LLM, and receive a usable answer.

TSAQA is a polite way of saying: not so fast.

The paper shows that the QA format is not the destination. It is the testing chamber. Translating a time series into a question does not automatically give the model the ability to reason about temporal structure. It merely gives us a cleaner way to discover whether that ability exists.

This distinction matters for AI products. A business dashboard agent may accept questions such as:

“Did this week’s traffic pattern break from the usual cycle?”
“Is this sensor reading an anomaly or a normal transition?”
“Which product category has the most similar seasonal pattern?”
“Does this transformed signal preserve the structure of the original?”
“What is the likely order of these operational events?”

All of these are natural-language questions. Only some are language problems. The rest are process problems wearing a language costume.

Puzzling is where the benchmark becomes interesting

The PZ task is the paper’s most revealing design choice. It asks the model to reconstruct order from shuffled time-series patches. That sounds almost childish until one remembers that many business decisions are also ordering problems.

Which came first: demand softening or inventory tightening? Did a latency spike precede conversion loss or follow it? Is a jump in warehouse throughput a real regime change or just a reporting discontinuity? In markets, operations, healthcare, and infrastructure, the order of events is not a formatting detail. It is the argument.

The paper reports that PZ questions are consistently harder than true-or-false and multiple-choice questions. The best PZ score is 67.68, achieved by instruction-tuned LLaMA-3.1-8B. That is far below its overall score of 85.26.

This gap is important. It tells us that instruction tuning can raise broad benchmark performance while leaving chronological reconstruction fragile. The model can become a better answer selector without becoming a fully reliable temporal reasoner. A very modern kind of promotion, really.

The paper’s input-length analysis adds a subtle twist. For most tasks, longer inputs reduce accuracy. That fits ordinary intuition: more tokens, more noise, more room for confusion. But for temporal relationship tasks, accuracy improves with longer input. The authors interpret this as evidence that PZ rewards the use of global context. Longer sequences may give the model more structural information for reconstructing order.

That interpretation should be handled carefully. It is a diagnostic correlation, not a proof that the model has learned causal reasoning in the human sense. Still, it supports the paper’s broader point: PZ is not just another multiple-choice variant. It probes whether the model can use broader temporal structure, not merely local similarity.

Data transformation exposes the difference between arithmetic comfort and structural understanding

One of the best sections of the paper is the task-specific analysis of data transformation. TSAQA tests whether models can recognize transformations such as Fourier transform, wavelet transform, and first-order differencing.

This is not a decorative technical detail. In real analytical workflows, transformations are how raw signals become interpretable. Differencing can remove trend. Fourier methods expose frequency structure. Wavelets capture localized frequency behavior. If an LLM agent claims to help with time-series analysis but cannot recognize how a transformation changes a signal, it is not an analyst. It is a narrator standing next to an analyst.

The paper’s numbers show a clear hierarchy. In zero-shot settings, models do much better on first-order differencing than on Fourier or wavelet transformations. Gemini-2.5-Flash, for example, reaches 100.00 on multiple-choice first-order differencing but only 27.97 on multiple-choice Fourier transform and 53.19 on wavelet transform. GPT-4.1 reaches 91.90 on multiple-choice first-order differencing, but only 26.32 on Fourier and 35.39 on wavelet.

Instruction tuning improves transformation results substantially. LLaMA-3.1-8B reaches 71.83 on multiple-choice Fourier, 88.79 on wavelet, and 99.70 on first-order differencing. But the remaining Fourier gap is telling. Simple local transformations are easier to learn; global frequency-domain structure is harder.

For businesses, this suggests a practical design rule: do not ask the LLM to silently perform signal-processing judgment when a deterministic library can do the operation directly. Let the tool compute the transform. Let the model explain, compare, or decide using verified outputs. The charming all-in-one agent fantasy may wait in the hallway.

Domain difficulty is not solved by naming the domain

TSAQA spans 210,000 samples across 13 domains. The underlying datasets include energy, finance, healthcare, nature, sales, transport, web, manufacturing, robotics, biomedical, environment sensing, IT operations, and synthetic data. That breadth matters because temporal behavior is domain-shaped. A smooth transition in one system may be suspicious in another. A discontinuity in web traffic may be normal; a discontinuity in a medical signal may deserve attention; a discontinuity in sales may be a promotion, a stockout, or someone’s spreadsheet having a small emotional crisis.

The authors’ domain analysis shows that some domains remain harder than others. Under zero-shot evaluation, domains such as Synthetic, IT, Robotics, and Web are challenging. After instruction tuning, Sales and Web remain among the most difficult. In the PZ temporal relationship task, Web remains difficult across zero-shot and instruction-tuned settings, while Sales also remains hard after tuning.

This is not merely a “some domains are harder” observation. The paper digs into one likely mechanism: smoothness bias.

For incorrect PZ predictions, the authors analyze boundary consistency. They compare the boundary distance between adjacent patches in the ground-truth sequence and in the predicted sequence. In difficult domains such as Web and Sales, tuned models often produce predicted sequences with smoother boundaries than the true sequence. In plain terms, the model tries to repair legitimate discontinuities.

That is a beautiful failure mode because it is so plausible. The model prefers a cleaner story than reality provides.

In business settings, that bias is dangerous. Volatility is not always noise. It can be the signal. A promotion starts. A product goes viral. A warehouse system switches status. A competitor launches. A fraud pattern begins. A sensor is recalibrated. If the model’s prior is “smooth continuation,” it may erase exactly the event the analyst needs to see.

The appendix is not extra furniture; it validates where the benchmark can be trusted

Benchmark papers often hide the interesting parts in appendices, like a restaurant putting the actual food in the storage room. TSAQA is no exception. Several appendix analyses matter for interpreting the main claim.

Paper component	Likely purpose	What it supports	What it does not prove
Main results table	Main evidence	Baseline model performance and effect of instruction tuning	That high overall accuracy means robust temporal reasoning
Input length analysis	Diagnostic analysis	Longer inputs generally hurt, except temporal relationship/PZ where global context may help	Human-like causal reasoning
Topic/subtopic analysis	Bias and difficulty check	More topics do not directly reduce accuracy; seasonality, autocorrelation, dispersion, and noise are harder	That every topic taxonomy is complete
Domain analysis	Robustness/sensitivity across domains	Domain variation affects difficulty; Web and Sales remain hard	That domain labels alone solve the problem
Data transformation analysis	Task-specific diagnostic	Fourier and wavelet reasoning are harder than first-order differencing	That models can replace numerical transform tools
Smoothness gap analysis	Error analysis	Models may over-smooth legitimate discontinuities	That all errors come from smoothness bias
Human evaluation	Benchmark quality validation	Multi-LLM labels are mostly aligned with expert judgments	Perfect ground truth, especially for comparison tasks

The human evaluation is particularly important. The benchmark uses LLM-assisted generation and multi-model consensus for characterization and comparison labels. That could easily become a circular evaluation trap: models judging questions made by models for models. The authors mitigate this by having six Ph.D.-level experts annotate 600 questions. They report uncertainty rates of 5% for characterization and 7% for comparison. For unambiguous cases, benchmark answers align with human judgments in 91.2% of characterization and 87.4% of comparison.

This does not make the benchmark flawless. It does make it more credible. Comparison remains harder and more ambiguous, which is unsurprising. Comparing two time series can involve multiple valid lenses: shape, lag, correlation, volatility, seasonality, local breaks. Humans also argue about these things, preferably with charts and coffee.

What the paper directly shows

The paper directly supports four conclusions.

First, existing general LLMs are not automatically strong time-series analysts. The best zero-shot commercial model reaches 65.08 overall, and several advanced tasks remain difficult.

Second, instruction tuning is powerful. The tuned open-source models dramatically outperform their zero-shot versions, and the best tuned 8B model exceeds the zero-shot commercial models on the benchmark overall.

Third, advanced temporal reasoning remains uneven. Data transformation and temporal relationship tasks reveal weaknesses that overall accuracy can hide. PZ is especially useful because it stresses chronological reconstruction rather than answer-format familiarity.

Fourth, model errors are structured, not random. The smoothness bias analysis suggests that models may impose generic continuity even when the domain legitimately contains sharp transitions.

That last point is the one business readers should keep. Random errors can be averaged, monitored, or sampled. Structured errors become product risk.

What Cognaptus infers for business use

The paper does not evaluate enterprise agents, trading bots, hospital systems, or factory dashboards directly. So the business implications below are inferences, not direct experimental results.

Still, they are useful inferences.

1. Evaluate temporal agents by capability category, not by chat quality

A dashboard assistant that explains trends well may fail at ordering events. A model that classifies simple patterns may fail at transformation reasoning. A model that handles stable energy demand may stumble on web traffic or sales volatility.

A practical evaluation suite should separate at least five capabilities:

Capability	Example business question	Evaluation style
Description	“What pattern is visible in this metric?”	Characterization QA
Diagnosis	“Is this point anomalous?”	Anomaly detection with ground truth
Comparison	“Which region behaves most similarly?”	Pairwise comparison tasks
Transformation reasoning	“Does this differenced or frequency-domain signal match the original?”	Tool-verified transformation QA
Temporal ordering	“Which event sequence is most plausible?”	PZ-like ordering tasks

This is cheaper than discovering the weakness after deployment, which is the traditional enterprise approach to benchmarking: optimistic demo first, incident review later.

2. Use LLMs as reasoning interfaces, not unchecked numerical engines

TSAQA suggests that natural-language QA is a valuable interface for time-series analysis. But the data transformation results argue against letting the model internally approximate mathematical operations when exact tools exist.

The sensible architecture is hybrid:

numerical libraries compute transformations, anomalies, features, and candidate continuations;
the LLM receives structured outputs and context;
the LLM explains, compares, asks follow-up questions, or ranks interpretations;
high-risk decisions keep deterministic checks and human review.

This is not less “AI-native.” It is more adult.

3. Tune smaller models for constrained tasks, but do not trust the average score alone

The instruction-tuned results are encouraging for companies that cannot or do not want to route every internal metric question through a large commercial model. An 8B model tuned on the right task distribution can become useful.

But the overall score should not be the procurement metric. A tuned model may look strong overall while still underperforming on PZ or Fourier-like transformation reasoning. In deployment, the weak slice matters more than the average when the weak slice maps to a critical workflow.

A retail company worried about promotion shocks should care about volatility and discontinuity. A trading system should care about regime shifts and ordering. A monitoring system should care about whether the model smooths away incidents. Benchmarks should be weighted by operational consequence, not academic neatness.

4. Treat volatility as domain knowledge, not noise to be cleaned away

The smoothness bias result is the paper’s most business-relevant failure mode. Many operational systems contain legitimate discontinuities. Sales spikes, web traffic bursts, fraud events, production interruptions, demand shocks, and policy changes do not ask permission before ruining a smooth curve.

If a model prefers smoothness, the remedy is not merely more prompting. The system needs domain-aware validation: event calendars, promotion logs, release notes, incident reports, market news, sensor maintenance records, and other exogenous context. Temporal reasoning without context is often just curve etiquette.

Where TSAQA stops short

TSAQA is valuable because it makes temporal reasoning testable. But the paper’s own limitations matter for deployment.

The benchmark is static. Real systems drift. Product categories change, user behavior evolves, sensors degrade, reporting definitions move, and markets occasionally behave as if they were designed by committee.

The benchmark uses standardized samples, which is appropriate for cross-domain evaluation but not identical to raw production data. Real data may be irregularly sampled, mixed-frequency, missing, delayed, or affected by exogenous drivers. A CFO does not ask whether a z-scored series has a dispersion pattern; she asks why gross margin moved after pricing changed and inventory lagged. The benchmark is closer to a diagnostic gym than a full business simulation.

The PZ task is also computationally and conceptually demanding. It rewards chronological reconstruction, but it may penalize models that prefer local smoothness even when that preference is sometimes useful. Future benchmarks will need to separate local continuity, long-range consistency, and domain-specific volatility more cleanly.

None of these limitations weaken the paper’s contribution. They define how to use it.

The better takeaway: temporal QA is an evaluation layer, not a magic layer

The strongest way to read TSAQA is not as another benchmark leaderboard. It is a design warning.

For years, the AI industry has been trying to make everything look like a chat problem. Documents became chat. Databases became chat. Code became chat. Dashboards became chat. This interface shift is useful. But the interface does not erase the structure of the underlying task.

Time series are not just text with brackets and commas. They carry order, spacing, scale, transformations, regimes, and domain-specific discontinuities. A model that can answer a question about a series may still not understand the process that generated it.

TSAQA gives researchers and builders a more disciplined way to expose that gap. It shows where models already improve with task-specific tuning. It shows where general models remain fragile. It shows why puzzling-style ordering tasks reveal failures that standard multiple-choice questions can miss. And it shows why smoothing reality into a nicer curve is not intelligence. It is just a spreadsheet with manners.

For business teams building LLM agents over finance, operations, healthcare, energy, or web metrics, the practical conclusion is straightforward: use LLMs, but test them by temporal capability. Pair them with numerical tools. Validate them against domain volatility. Watch the failure modes, not only the average score.

Time is not another column in the prompt. It is the structure the model has to respect.

Cognaptus: Automate the Present, Incubate the Future.

Baoyu Jing, Sanhorn Chen, Lecheng Zheng, Boyu Liu, Zihao Li, Jiaru Zou, Tianxin Wei, Zhining Liu, Zhichen Zeng, Ruizhong Qiu, Xiao Lin, Yuchen Yan, Dongqi Fu, Jingchao Ni, Jingrui He, and Hanghang Tong, “TSAQA: Time Series Analysis Question And Answering Benchmark,” arXiv:2601.23204. ↩︎

The first result is not failure; it is uneven competence#

TSAQA tests time-series reasoning as a menu, not a single dish#

The misconception: language wrapping is not temporal reasoning#

Puzzling is where the benchmark becomes interesting#

Data transformation exposes the difference between arithmetic comfort and structural understanding#

Domain difficulty is not solved by naming the domain#

The appendix is not extra furniture; it validates where the benchmark can be trusted#

What the paper directly shows#

What Cognaptus infers for business use#

1. Evaluate temporal agents by capability category, not by chat quality#

2. Use LLMs as reasoning interfaces, not unchecked numerical engines#

3. Tune smaller models for constrained tasks, but do not trust the average score alone#

4. Treat volatility as domain knowledge, not noise to be cleaned away#

Where TSAQA stops short#

The better takeaway: temporal QA is an evaluation layer, not a magic layer#