Rationales Before Results: Teaching Multimodal LLMs to Actually Reason About Time Series

Opening — Why this matters now

Multimodal LLMs are increasingly being asked to reason about time series: markets, traffic, power grids, pollution. Charts are rendered. Prompts are polished. The answers sound confident. And yet—too often—they’re wrong for the most boring reason imaginable: the model never actually reasons.

Instead, it pattern-matches.

This paper dissects that failure mode with unusual clarity. The authors argue that the bottleneck is not model scale, data access, or even modality alignment. It’s the absence of explicit reasoning priors that connect observed temporal patterns to downstream outcomes. Without those priors, multimodal LLMs hallucinate explanations after the fact, mistaking surface similarity for causality.

Their proposal—Rationale-Grounded In-Context Learning, implemented as RationaleTS—is deceptively simple: give the model reusable reasoning paths before you ask for predictions.

Background — What existed before (and why it wasn’t enough)

Three dominant paradigms currently shape LLM-based time-series reasoning:

Paradigm	What’s Retrieved	What’s Missing
In-Context Learning (ICL)	Similar samples	Explicit reasoning logic
Retrieval-Augmented Generation (RAG)	Documents / facts	Temporal structure & causality
Chain-of-Thought (CoT)	Free-form reasoning	Grounding, transferability

Even when time series are visualized as charts for MLLMs, the model typically extrapolates local trends (“it went up last hour, so maybe up next hour”) or latches onto the most salient variable. The explanation follows the prediction—not the other way around.

The paper’s key insight is blunt but accurate: explanations are being treated as decorations, not infrastructure.

Analysis — What RationaleTS actually does

RationaleTS introduces rationales as first-class reasoning units. Not labels. Not examples. But structured, reusable causal paths.

Each rationale is a bulleted list of:

Observation → Implication

Crucially, these are label-conditioned but label-hidden. The model knows the outcome during rationale generation, but the outcome is never stated in the rationale itself. This prevents shortcut learning.

The method unfolds in three stages:

1. Abductive Rationale Generation

Given a historical time-series chart and the true future outcome, an MLLM is asked to retrospectively justify the result using causal reasoning paths.

Example (traffic domain):

Occupancy remains near a trough → signals latent demand and likely mean reversion
Humidity rises while wind speed stays low → slower traffic speeds → higher occupancy

Over the training set, these rationales form a rationale base—a library of domain-specific reasoning priors.

2. Hybrid Retrieval (The clever part)

When faced with a new query, RationaleTS does not retrieve similar samples. It retrieves similar reasoning paths, using a hybrid similarity score:

Data-centric similarity: temporal embeddings from TabPFN capture cross-variable dynamics.
Semantic similarity: text embeddings align the query’s summarized patterns with stored rationales.

The final score blends both:

$$ \text{Sim}{final} = \lambda \cdot \text{Sim}{data} + (1-\lambda) \cdot \text{Sim}_{semantic} $$

This avoids the classic failure where two time series look similar numerically but arise from completely different mechanisms.

3. Rationale-Grounded In-Context Inference

The model receives:

The new chart
Top-K retrieved rationales

It is explicitly instructed to reason with these paths before predicting. Reasoning is generated first; the final prediction follows.

This ordering turns out to matter.

Findings — Results that actually move the needle

Across finance, traffic, and power datasets, RationaleTS consistently outperforms:

Pure LLM-based time-series models
Multimodal zero-shot inference
CoT and exemplar-based ICL

Performance Snapshot

Dataset	Metric	Best Baseline	RationaleTS
Finance	F1	66.53	69.76
Power	F1	70.60	71.50
Traffic	F1	62.23	66.21

Efficiency matters too. RationaleTS achieves higher accuracy with fewer input tokens than methods that repeatedly stuff charts and labels into prompts.

More tokens, it turns out, are not a substitute for better priors.

Implications — Why this matters beyond benchmarks

This work quietly reframes how we should think about “reasoning” in LLM systems:

Reasoning is a retrievable asset Not something regenerated from scratch every prompt.
RAG for logic beats RAG for facts Especially in domains where causality matters more than trivia.
Time-series + LLM ≠ forecasting The real value lies in decision-support reasoning—trend direction, regime shifts, early warnings.

For business users, this suggests a new class of AI systems:

Financial risk monitors grounded in historical causal patterns
Infrastructure diagnostics that reuse past failure logic
Operational dashboards that explain why, not just what

Conclusion — Rationality as infrastructure

RationaleTS doesn’t make models smarter. It makes them less lazy.

By forcing multimodal LLMs to anchor predictions in reusable reasoning paths, it closes the gap between explanation and inference—a gap that scaling alone has failed to solve.

If the last wave of AI was about bigger models, this one might be about better priors.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — What existed before (and why it wasn’t enough)#

Analysis — What RationaleTS actually does#

1. Abductive Rationale Generation#

2. Hybrid Retrieval (The clever part)#

3. Rationale-Grounded In-Context Inference#

Findings — Results that actually move the needle#

Performance Snapshot#

Implications — Why this matters beyond benchmarks#

Conclusion — Rationality as infrastructure#