When LLMs Meet Time: Why Time-Series Reasoning Is Still Hard

Opening — Why this matters now

Large Language Models are increasingly marketed as general problem solvers. They summarize earnings calls, reason about code, and explain economic trends with alarming confidence. But when confronted with time—real, numeric, structured temporal data—that confidence starts to wobble. The TSAQA benchmark arrives at exactly the right moment, not to celebrate LLM progress, but to measure how far they still have to go.

Background — From forecasting to understanding

Traditional time-series research has focused on a narrow set of tasks: forecasting, anomaly detection, classification, and imputation. These are useful, but incomplete. Real decision-making requires something deeper—understanding structure, relationships, transformations, and causality across time.

Recent work reframes time-series analysis as a question-answering problem, translating numeric sequences into reasoning tasks expressed in natural language. TSAQA (Time-Series Analysis Question Answering) pushes this idea further by systematically probing whether LLMs can reason about time series, not merely describe them.

What TSAQA actually tests

TSAQA is not a single task but a carefully constructed benchmark spanning five reasoning categories:

Category	What it probes
Characterization	Trends, seasonality, noise, stationarity
Comparison	Cross-series relationships and correlations
Data Transformation	Fourier, wavelet, and differencing logic
Temporal Relationship (PZ)	Chronological ordering and causality
Anomaly & Classification	Structured judgment under constraints

Each task is instantiated via strict templates (MC, TF, and PZ), forcing models to respond in tightly controlled formats—no narrative escape hatches.

The Puzzling (PZ) task — Where models crack

The most revealing contribution of the paper is the Puzzling (PZ) task. Models are given one time-series segment and four shuffled successors, then asked to reconstruct the correct chronological order.

This sounds trivial. It isn’t.

Performance on PZ improves with longer inputs—the opposite of most language tasks—suggesting that success depends on genuine global temporal reasoning rather than local pattern matching. Errors reveal a consistent smoothness bias: models prefer artificially smooth transitions, even when the ground truth contains legitimate volatility.

In other words, LLMs hallucinate temporal coherence where none exists.

Results — The uncomfortable numbers

Zero-shot performance is sobering. Even the strongest commercial models average barely above 65%. Instruction tuning helps open-source models, but the ceiling remains low.

Model	Setting	Avg. Accuracy
Gemini 2.5 Flash	Zero-shot	~65
GPT-4.1	Zero-shot	~64
LLaMA-3.1-8B	Instruction-tuned	~78

Notably, domain context offers little rescue. Whether time series come from finance, healthcare, or energy, the core difficulty persists.

What this tells us about LLMs

Three uncomfortable truths emerge:

Textual fluency ≠ temporal understanding
Instruction-following breaks under structural constraints
Numeric time remains a foreign language to LLMs

TSAQA exposes a gap between language reasoning and process reasoning. LLMs excel at describing time, but struggle to inhabit it.

Implications — For research and for business

For researchers, TSAQA sets a higher bar: future models must internalize temporal structure, not just tokenize numbers. For practitioners, the warning is clearer—LLMs are not ready to replace domain-specific time-series systems in high-stakes settings like trading, monitoring, or control.

Hybrid architectures—symbolic operators, structured memory, or explicit temporal modules—are no longer optional. They are the path forward.

Conclusion

TSAQA doesn’t argue that LLMs are useless for time series. It argues something more important: we have been testing the wrong things. True intelligence unfolds over time, and until models can reason within it, their understanding will remain—quite literally—out of sequence.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From forecasting to understanding#

What TSAQA actually tests#

The Puzzling (PZ) task — Where models crack#

Results — The uncomfortable numbers#

What this tells us about LLMs#

Implications — For research and for business#

Conclusion#