Seeing Is Judging: Why LLMs Are Better Critics Than Creators in Time-Series Reasoning

Opening — Why this matters now

Everyone wants AI to explain data. Fewer people ask whether those explanations are actually true.

In finance, operations, and industrial monitoring, large language models (LLMs) are increasingly used to narrate time-series data—price movements, sensor signals, demand curves. The narrative sounds convincing. The numbers, less so.

The uncomfortable reality is simple: fluency has outpaced fidelity.

A recent paper, LLM-as-a-Judge for Time Series Explanations fileciteturn0file0, explores a subtle but consequential shift: instead of asking LLMs to explain data, what if we asked them to judge explanations instead?

The answer is not just academic. It reshapes how we design AI systems for decision-making.

Background — Context and prior art

Historically, evaluating AI-generated explanations has been… lazy.

Most systems rely on:

Method	What it Measures	Core Limitation
BLEU / ROUGE	Text similarity	Ignores numerical correctness
Embedding similarity	Semantic closeness	Cannot detect hallucinated statistics
NLI models	Logical entailment	Requires reference answers
Time-series models	Numerical patterns	Cannot evaluate language

Notice the pattern: no method simultaneously understands numbers and language.

That gap becomes fatal in time-series explanation tasks, where correctness depends on:

Pattern recognition (e.g., spike vs. shift)
Numeric accuracy (e.g., % change)
Logical grounding (e.g., causal claims)

Until now, the industry workaround has been human review. Expensive. Slow. Inconsistent.

Analysis — What the paper actually does

The paper proposes a deceptively simple idea: use LLMs as reference-free evaluators.

1. The Core Setup

Each task includes:

A time series $T$
A question $q$
A candidate explanation $e$

The model outputs a score:

Label	Meaning
0	Incorrect
1	Partially correct
2	Fully correct

This is not just classification—it’s structured reasoning.

2. The Hidden Rubric

Instead of relying on ground-truth text, the model evaluates explanations using five implicit criteria:

Dimension	What it checks
Data faithfulness	Does it match the actual pattern?
Numeric accuracy	Are calculations correct?
Question relevance	Does it answer the question?
Logical coherence	Is the reasoning consistent?
Unsupported claims	Any hallucinations?

In other words, the model is forced to audit the explanation against the data—not against another piece of text.

3. Multi-Task Evaluation Framework

The authors don’t stop at scoring. They test four distinct capabilities:

Task	What it measures	Difficulty
Generation	Can the model explain?	High
Ranking	Can it pick the best explanation?	Medium
Scoring	Can it label correctness?	Medium
Anomaly detection	Can it reason directly from data?	High

This separation is crucial. Most systems conflate generation and evaluation. The paper deliberately splits them.

Findings — Results with visualization

The results are… mildly embarrassing for generation.

1. Generation Performance Is Fragile

Query Type	Best Accuracy
Structural Break	~0.96
Linear Spike	~0.94
Mean Shift	~0.96
Seasonal Drop	~0.00–0.12
Volatility Shift	0.00

Yes—zero.

LLMs completely fail to explain volatility changes.

Not degrade. Not struggle. Fail.

2. Evaluation Performance Is Surprisingly Stable

Task	Typical Accuracy
Ranking	Up to ~0.96
Independent Scoring	~0.70–0.96

Even more interesting:

Models can correctly identify the best explanation—even when they cannot generate it themselves.

That’s not a bug. That’s a structural asymmetry.

3. Generation vs Evaluation Gap

Capability	Behavior
Generation	Pattern-dependent, brittle
Evaluation	Consistent, robust

Think of it this way:

Writing is hard.
Critiquing is easier.

LLMs, it turns out, behave exactly like humans in this regard—just less self-aware.

4. Anomaly Detection: High Recall, Poor Discipline

Metric	Observation
Recall	High (finds most anomalies)
Precision	Low (adds false positives)
Count accuracy	Poor

Translation: the model is paranoid.

It would rather over-detect than miss something—useful in risk systems, problematic in operations.

Implications — Next steps and significance

1. Stop Treating LLMs as Single-Role Systems

The industry default is: one model, one task.

This paper suggests a better architecture:

Role	Best Use of LLM
Generator	Draft explanations
Evaluator	Verify correctness

In other words: separate thinking from checking.

2. Build “Dual-Agent” Systems

A practical system might look like:

Model A generates explanation
Model B evaluates it using a rubric
Optional: Model C resolves disagreements

This is not redundancy—it’s reliability engineering.

3. Evaluation Is Closer to Governance Than Generation

From a business perspective, the real value is not generation.

It’s assurance.

Can we trust the explanation?
Can we detect hallucinated numbers?
Can we audit reasoning automatically?

That’s governance infrastructure, not UX.

4. Weakness in Higher-Order Statistics Is a Red Flag

The consistent failure on volatility shifts reveals something deeper:

LLMs struggle with second-order properties (variance, distribution changes), not just first-order patterns.

For finance, this is non-trivial.

Volatility is not a feature. It is the system.

Conclusion — Wrap-up

The paper makes a quiet but important point:

LLMs are not yet reliable narrators of data—but they are becoming competent auditors.

That distinction matters.

In practice, the future of AI systems in data-heavy domains will likely not be a single “smart” model, but a structured ecosystem of roles:

One model to speak
One model to verify
One system to arbitrate

If generation is creativity, evaluation is control.

And in business, control usually wins.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Context and prior art#

Analysis — What the paper actually does#

1. The Core Setup#

2. The Hidden Rubric#

3. Multi-Task Evaluation Framework#

Findings — Results with visualization#

1. Generation Performance Is Fragile#

2. Evaluation Performance Is Surprisingly Stable#

3. Generation vs Evaluation Gap#

4. Anomaly Detection: High Recall, Poor Discipline#

Implications — Next steps and significance#

1. Stop Treating LLMs as Single-Role Systems#

2. Build “Dual-Agent” Systems#

3. Evaluation Is Closer to Governance Than Generation#

4. Weakness in Higher-Order Statistics Is a Red Flag#

Conclusion — Wrap-up#