Opening — Why this matters now
Large Language Models are increasingly marketed as general problem solvers. They summarize earnings calls, reason about code, and explain economic trends with alarming confidence. But when confronted with time—real, numeric, structured temporal data—that confidence starts to wobble. The TSAQA benchmark arrives at exactly the right moment, not to celebrate LLM progress, but to measure how far they still have to go.
Background — From forecasting to understanding
Traditional time-series research has focused on a narrow set of tasks: forecasting, anomaly detection, classification, and imputation. These are useful, but incomplete. Real decision-making requires something deeper—understanding structure, relationships, transformations, and causality across time.
Recent work reframes time-series analysis as a question-answering problem, translating numeric sequences into reasoning tasks expressed in natural language. TSAQA (Time-Series Analysis Question Answering) pushes this idea further by systematically probing whether LLMs can reason about time series, not merely describe them.
What TSAQA actually tests
TSAQA is not a single task but a carefully constructed benchmark spanning five reasoning categories:
| Category | What it probes |
|---|---|
| Characterization | Trends, seasonality, noise, stationarity |
| Comparison | Cross-series relationships and correlations |
| Data Transformation | Fourier, wavelet, and differencing logic |
| Temporal Relationship (PZ) | Chronological ordering and causality |
| Anomaly & Classification | Structured judgment under constraints |
Each task is instantiated via strict templates (MC, TF, and PZ), forcing models to respond in tightly controlled formats—no narrative escape hatches.
The Puzzling (PZ) task — Where models crack
The most revealing contribution of the paper is the Puzzling (PZ) task. Models are given one time-series segment and four shuffled successors, then asked to reconstruct the correct chronological order.
This sounds trivial. It isn’t.
Performance on PZ improves with longer inputs—the opposite of most language tasks—suggesting that success depends on genuine global temporal reasoning rather than local pattern matching. Errors reveal a consistent smoothness bias: models prefer artificially smooth transitions, even when the ground truth contains legitimate volatility.
In other words, LLMs hallucinate temporal coherence where none exists.
Results — The uncomfortable numbers
Zero-shot performance is sobering. Even the strongest commercial models average barely above 65%. Instruction tuning helps open-source models, but the ceiling remains low.
| Model | Setting | Avg. Accuracy |
|---|---|---|
| Gemini 2.5 Flash | Zero-shot | ~65 |
| GPT-4.1 | Zero-shot | ~64 |
| LLaMA-3.1-8B | Instruction-tuned | ~78 |
Notably, domain context offers little rescue. Whether time series come from finance, healthcare, or energy, the core difficulty persists.
What this tells us about LLMs
Three uncomfortable truths emerge:
- Textual fluency ≠ temporal understanding
- Instruction-following breaks under structural constraints
- Numeric time remains a foreign language to LLMs
TSAQA exposes a gap between language reasoning and process reasoning. LLMs excel at describing time, but struggle to inhabit it.
Implications — For research and for business
For researchers, TSAQA sets a higher bar: future models must internalize temporal structure, not just tokenize numbers. For practitioners, the warning is clearer—LLMs are not ready to replace domain-specific time-series systems in high-stakes settings like trading, monitoring, or control.
Hybrid architectures—symbolic operators, structured memory, or explicit temporal modules—are no longer optional. They are the path forward.
Conclusion
TSAQA doesn’t argue that LLMs are useless for time series. It argues something more important: we have been testing the wrong things. True intelligence unfolds over time, and until models can reason within it, their understanding will remain—quite literally—out of sequence.
Cognaptus: Automate the Present, Incubate the Future.