Seeing Is Judging: Why LLMs Are Better Critics Than Creators in Time-Series Reasoning

A dashboard says revenue demand has “stabilized.” A monitoring agent says a sensor spike is “temporary.” A trading assistant says volatility has “fallen after the regime shift.”

The sentence is smooth. The chart is nearby. The user is tired. That is usually enough for a bad explanation to survive.

This is the quiet problem behind AI-assisted analytics: not whether a language model can write a plausible story about time-series data, but whether the story is faithful to the numbers. A recent paper, LLM-as-a-Judge for Time Series Explanations, studies exactly this gap by asking models to play two different roles: narrator and critic.¹

The result is useful because it does not flatter the model. Open-ended explanation generation is uneven. Sometimes it works; sometimes it collapses on statistically simple-looking but cognitively awkward patterns. Yet the same family of models can often rank or score candidate explanations more reliably when guided by explicit rubrics.

That distinction matters for business systems. It suggests that the first serious use of LLMs in time-series analytics may not be fully automated storytelling. It may be automated checking.

Not glamorous. Very useful. A familiar pattern.

The real comparison is narrator versus auditor

The obvious way to use an LLM with time-series data is to ask it to explain what happened.

Give it a series. Ask a question. Receive a paragraph.

That workflow is convenient, but it hides two tasks inside one interface. The first task is generation: producing an explanation. The second is evaluation: checking whether an explanation is correct. The paper’s central contribution is to separate these two roles and test them under the same controlled benchmark.

The authors build a reference-free evaluation setting. Instead of comparing a generated answer to a gold textual answer, the model is asked to evaluate whether a candidate explanation is consistent with the raw numerical series. That matters because two correct explanations may use different wording, while one fluent explanation may contain a wrong index, a wrong percentage change, or an invented trend. BLEU, ROUGE, embedding similarity, and ordinary text entailment are not built to catch that kind of error. They judge words against words; the problem here is words against data.

The proposed judging task is ternary:

Label	Meaning	Practical interpretation
0	Incorrect	The explanation contradicts the series or fails the question.
1	Partially correct	The pattern is partly right, but a numeric claim, location, or detail is wrong.
2	Fully correct	The explanation identifies the relevant pattern and supports it accurately.

The rubric behind this label is broader than a single arithmetic check. The evaluator is prompted to consider data faithfulness, numeric accuracy, question relevance, logical coherence, and unsupported claims. In plain business language: does the explanation point to the right event, compute the right thing, answer the actual question, and avoid adding decorative nonsense?

That is a better frame than “Can LLMs understand time series?” The paper asks a more operational question: where in the analytics workflow should the model sit?

TSQueryBench makes the model face controlled mistakes

The benchmark, TSQueryBench, contains 350 synthetic univariate time-series instances across seven query types. Each type has 50 instances, and sequence lengths vary across 100, 200, 300, and 500 time steps. For every instance, the authors construct three candidate explanations: one correct, one partially correct, and one incorrect. This yields 1,050 explanation instances for independent scoring.

The seven query types are not random decorations. They probe different reasoning demands:

Query type	What the model must notice	Why it is business-relevant
Linear spike	A localized transient anomaly	Fraud, incident detection, sensor alerts
Seasonal drop	A patterned downward deviation	Demand cycles, staffing, seasonal operations
Structural break	A persistent level shift	Policy changes, pricing changes, regime shifts
Multi-metric consistency	Several claims must agree	KPI dashboards and cross-checking reports
Relative extremum	A highest or lowest point	Peak demand, bottoming signals, capacity planning
Mean shift	Change in average level	Product launch impact, process drift, trend breaks
Volatility shift	Change in variance	Risk, uncertainty, market turbulence, unstable processes

The design is synthetic, which limits direct claims about messy real-world deployments. But it also gives the paper something real-world anecdotes often lack: controlled wrong answers. A business QA system cannot merely ask whether an explanation sounds reasonable. It must distinguish “right event, wrong number” from “wrong event, confident tone.” TSQueryBench is built around that distinction.

The paper then tests three open-weight models: Qwen-3 8B, LLaMA-3.1 8B, and Gemma-2 9B IT. The models are evaluated zero-shot, with standardized structured prompts and JSON-constrained outputs.

This setup is not a claim that one 8B-class model is now a universal analytics referee. It is a test of role separation: what changes when the model judges an explanation rather than generating one?

The four experiments are not four separate stories

The paper’s experimental design is easiest to understand as a ladder of responsibility.

Experiment	Likely purpose	What it supports	What it does not prove
Explanation generation	Main evidence for open-ended narration quality	Whether models can produce faithful explanations from raw series and question	That a fluent answer is safe for deployment
Relative ranking	Main evidence for comparative judging	Whether models can select the best explanation among correct, partial, and wrong candidates	That the model can independently certify any single answer
Independent scoring	Main evidence for reference-free auditing	Whether models can label one candidate explanation without seeing alternatives	That the scoring is calibrated enough for high-stakes automation
Multi-anomaly detection	Exploratory stress test of direct numerical reasoning	Whether models can locate many anomalies without given thresholds	That LLMs can replace statistical anomaly detectors

This classification is important because the appendix examples are not a second thesis. They illustrate the mechanism behind the results: the model may fail to generate the correct volatility explanation, then correctly rank a provided volatility explanation when the right candidate is placed next to weaker alternatives.

That is the whole business lesson in miniature. The model is not necessarily good at discovering the answer from scratch. It may still be useful at checking whether a proposed answer survives a rubric.

Generation fails unevenly, not randomly

The generation results are the least flattering part of the paper, which makes them the most useful.

In Table 1, Qwen-3 8B performs strongly on several query types: 0.94 on linear spike, 0.96 on structural break, and 0.96 on mean shift. LLaMA-3.1 8B also reaches 0.94 on structural break. These are not trivial tasks; they require the model to connect the question to a visible property of the series and produce a grounded explanation.

But the same table also shows sharp fragility:

Query type	LLaMA-3.1 8B	Gemma-2 9B	Qwen-3 8B	Interpretation
Linear spike	0.70	0.50	0.94	Local anomaly detection is relatively tractable.
Structural break	0.94	0.90	0.96	Persistent level changes are comparatively easy.
Mean shift	0.28	0.15	0.96	Strong model dependence; Qwen dominates.
Relative extremum	0.36	0.35	0.08	Finding the right maximum/minimum is surprisingly brittle.
Volatility shift	0.00	0.00	0.00	All three fail on variance-change explanation generation.

There is also a small but important nuance. The abstract frames seasonal drop and volatility shift as difficult, with very low accuracies mentioned for the former. Table 1 gives a more model-dependent picture: seasonal drop is 0.12 for LLaMA, 0.00 for Gemma, but 0.82 for Qwen. So the careful interpretation is not “seasonality is always impossible.” It is that certain query types are highly model-sensitive, while volatility shift is the consistent failure case across all three tested models.

That failure is not a cosmetic issue. Volatility is a second-order property. It is about variability, not just level. A model can often see a spike or a mean change because the answer is visually and linguistically local: “this point jumps,” “the series shifts upward.” Volatility requires comparing dispersion across segments. It asks the model to reason about the spread of values, not just the values themselves.

For finance, operations, and industrial monitoring, this distinction is large. A system that can describe a price jump but cannot correctly explain volatility change is not a risk analyst. It is a caption writer with a calculator-shaped hat.

Ranking works because the candidate answers narrow the search

The relative ranking task changes the job. The model no longer has to produce the full explanation. It receives three candidates: correct, partially correct, and incorrect. It must choose the best one.

This is where the generation-evaluation asymmetry appears.

Qwen-3 8B reaches 0.96 on linear spike, 0.92 on seasonal drop, 0.94 on structural break, 0.94 on multi-metric consistency, and 0.94 on mean shift. Even on volatility shift, where all models scored 0.00 in generation, Qwen reaches 0.62 in relative ranking. Not excellent, but very different from total collapse.

The mechanism is intuitive. Generation requires the model to decide what matters, compute or estimate the relevant quantity, formulate a response, and avoid unsupported claims. Ranking supplies candidate hypotheses. The model can compare those hypotheses against the series and the rubric. That reduces the search space.

In human terms: writing the correct diagnosis is harder than recognizing the best diagnosis from a shortlist. Medical exams, legal review, investment committee memos, and code reviews all exploit this asymmetry. The paper shows the same pattern in time-series explanation.

This does not mean ranking is solved. LLaMA and Gemma are weaker and more uneven. Relative extremum remains awkward: Qwen reaches only 0.38, while Gemma reaches 0.70. Long inputs also create practical constraints; the paper reports NA for Gemma on longer sequence lengths because the model could not process those inputs due to token limits.

Still, the direction is clear: comparative evaluation is often more reliable than open-ended narration.

Independent scoring is closer to a real QA gate

Ranking is useful, but many business workflows do not present three explanations and ask the system to choose. They generate one explanation and need to know whether it passes.

That is why the independent scoring task is more operationally important. The model receives one candidate explanation and assigns label 0, 1, or 2. This resembles a QA gate for dashboard commentary, analyst notes, incident reports, or automated alerts.

The results again favor evaluation over generation, especially for Qwen:

Query type	Qwen-3 8B independent scoring	Business reading
Linear spike	0.96	Strong for localized anomaly explanation checks.
Seasonal drop	0.82	Useful, but model-specific.
Structural break	0.75	Decent but weaker than generation on this category.
Multi-metric consistency	0.65	Multi-claim checking remains difficult.
Relative extremum	0.45	Weak; maxima/minima localization is fragile.
Mean shift	0.91	Strong for average-level change checks.
Volatility shift	0.72	Much better than generation, but not safe enough alone.

This is the most interesting business result in the paper. Volatility shift generation is 0.00 across models, but Qwen’s independent scoring on volatility shift is 0.72. The model cannot reliably write the volatility explanation from scratch, yet it can often judge whether a provided volatility explanation is correct.

That should change how teams design AI analytics systems.

A naive design says:

Time series -> LLM -> published explanation

A more serious design says:

Time series -> statistical feature extraction -> draft explanation -> rubric-guided LLM judge -> publish / revise / escalate

The second design does not treat the LLM as an oracle. It treats the LLM as a language-aware auditor sitting beside statistical checks. Less magical. More deployable.

Multi-anomaly detection shows sensitivity without discipline

The fourth experiment asks models to identify all anomalies in a series and quantify percentage changes. This is not just explanation checking. It is direct numerical reasoning over the series, without explicit anomaly thresholds.

The results are mixed in a very revealing way. Count accuracy is low across sequence lengths. Qwen reaches count accuracy between 0.08 and 0.16 depending on length; LLaMA is mostly 0.00 to 0.04; Gemma is available only for length 100, where it reaches 0.04. Yet F1 scores are moderate: for Qwen, from 0.455 to 0.585 across tested lengths; for Gemma at length 100, 0.619.

The appendix example makes the pattern concrete. In one multi-anomaly instance with 9 ground-truth anomalies, the model predicts 11 anomalies. It localizes all 9 true anomalies but adds 2 false positives. The count is wrong; recall is high; precision suffers.

The model is sensitive, but undisciplined.

That is not automatically bad. In safety monitoring, over-detection may be acceptable if a downstream filter exists. In operations planning, over-detection can create alert fatigue. In finance, it depends on whether the signal triggers a warning, a trade, or a compliance report. One false positive in a dashboard is a nuisance. One false positive in an automated order system is a bill.

The paper’s anomaly result therefore should not be read as “LLMs can detect anomalies.” A stricter reading is better: LLMs can often notice candidate deviations, but they are poor at threshold calibration when left to infer the rule themselves.

Why judging is easier than explaining

The paper does not need a mystical theory of intelligence to explain the asymmetry. Three mechanisms are enough.

First, candidate explanations compress the hypothesis space. When generating, the model must discover the relevant pattern and phrase it correctly. When judging, it can inspect a proposed claim: “the break happens around index X,” “the percentage change is Y,” “the volatility decreases after Z.” Each claim becomes checkable.

Second, the rubric turns vague quality into separable failure modes. A bad explanation may be wrong because it misidentifies the event, because its number is wrong, because it answers the wrong question, or because it invents a claim. The ternary label is simple, but the prompting structure encourages decomposition.

Third, evaluation tolerates less creativity. Generation rewards fluency and completion. Evaluation rewards constraint. For business analytics, constraint is often the point. A model that says “I cannot verify this claim” is less charming than one that produces a polished paragraph, but charm is not a control system.

This is also why the result is not merely “use two LLMs.” The deeper point is architectural: separate the role that produces language from the role that verifies language against evidence.

The business value is assurance, not prettier commentary

The paper directly shows that, on a synthetic benchmark of univariate time-series questions, rubric-guided LLM evaluation can be more stable than open-ended explanation generation. It does not directly show production ROI, reduced analyst workload, or safe deployment in regulated environments.

Those are Cognaptus-level inferences, and they need boundaries.

The practical pathway looks like this:

Business workflow	Direct paper result used	Operational implication	Boundary
BI dashboard commentary	LLMs can score candidate explanations against raw series	Add an explanation QA layer before publishing narrative summaries	Synthetic benchmark; real dashboards have messy metadata and multiple series
Financial monitoring	Generation fails on volatility, scoring performs better	Use symbolic volatility checks plus LLM narrative auditing	Do not let the LLM define risk thresholds alone
Industrial alerts	Models over-detect anomalies but often localize true ones	Use LLMs to review and explain statistically detected alerts	Alert thresholds should remain statistical or rule-based
Healthcare signal review	Rubric can check numeric and pattern faithfulness	Use as a second-pass reviewer for clinician-facing summaries	Human review remains essential in high-stakes interpretation
Automated report generation	Ranking/scoring can reject weak explanations	Build generator-judge-revise loops	Judge errors must be logged and sampled, not blindly trusted

The most attractive near-term use is not replacing analysts. It is catching bad paragraphs before they become organizational memory.

An LLM-generated report can be wrong in a way that looks finished. That is dangerous because finished prose reduces scrutiny. A rubric-guided judge reintroduces friction at the right place. It asks: does this sentence actually survive contact with the data?

That is governance infrastructure. Less cinematic than an autonomous analyst, but far more useful.

A practical architecture: generate, verify, then decide

A reasonable production pattern would not copy the paper verbatim. It would combine statistical checks, generated explanations, and LLM-based judging.

One possible architecture:

Stage	Component	Role
1	Time-series feature extractor	Compute spikes, breakpoints, rolling mean, variance, and anomaly candidates.
2	Explanation generator	Draft a plain-language explanation using computed features and chart context.
3	Rubric-guided LLM judge	Check data faithfulness, numeric accuracy, relevance, coherence, and unsupported claims.
4	Symbolic validator	Verify exact arithmetic, indices, units, thresholds, and business rules.
5	Decision layer	Publish, revise, suppress, or escalate to a human reviewer.

The judge should not be asked to do everything. It is best used where language and evidence meet: detecting whether a written explanation is faithful to the data and the question.

Exact calculations should be done outside the model. Threshold rules should be explicit. Long multivariate context should be summarized into structured features before the judge sees it. And judge outputs should be auditable: label, rationale, failed rubric dimension, confidence if available, and the exact claims being disputed.

In other words, the LLM judge is not a priest. It is a compliance clerk with unusually good reading comprehension. Treat it accordingly.

The limits are small enough to be useful and large enough to matter

The paper’s boundaries are not minor footnotes.

First, the benchmark is synthetic. That is useful for controlled reasoning, but real time-series data carry missing values, calendar effects, regime changes, metadata problems, business definitions, and multiple interacting signals. A clean univariate sequence is not a messy enterprise dashboard.

Second, the series are univariate. Many business explanations are multivariate by nature: revenue moved because price, promotion, seasonality, supply, and competitor behavior changed together. Evaluating a sentence about one series is easier than evaluating a causal explanation across several related metrics.

Third, the models are relatively small open-weight models. This is good for practical relevance, but it also means the results should not be generalized to all frontier systems without testing. Larger models may improve some tasks; they may also fail differently.

Fourth, volatility and anomaly thresholds remain weak points. This is the paper’s most important warning for risk-heavy domains. A model that can judge a candidate volatility explanation at 0.72 accuracy is useful as a filter. It is not a standalone risk engine.

Fifth, reference-free does not mean truth-free. The absence of a gold textual answer makes the setting more realistic, but correctness still depends on reliable numerical grounding. In production, that grounding should come from computed features and validated data pipelines, not from the model’s internal arithmetic instincts.

These limitations do not erase the result. They tell us where to put the result.

The useful lesson is role separation

The old dream was simple: feed data to an LLM and get a trustworthy explanation.

The paper suggests a better, less naive dream: use LLMs in specialized roles, with one role drafting language and another role auditing that language against data. The judge is not perfect, but it is often better at constrained evaluation than unconstrained explanation.

That is the part businesses should remember.

Not “LLMs understand time series.” Too broad.

Not “LLMs cannot reason about numbers.” Too lazy.

The sharper lesson is this: LLMs may be more valuable as critics of data narratives than as their original authors. Especially when the workflow supplies candidate explanations, explicit rubrics, computed features, and escalation paths.

The future analytics stack will probably not be a single model confidently narrating every chart. It will be a small bureaucracy of models, checks, and validators: one system to detect, one to explain, one to judge, one to verify, and one to decide whether the paragraph deserves to exist.

A little bureaucracy is annoying. In analytics, it is also how false confidence gets taxed.

Cognaptus: Automate the Present, Incubate the Future.

Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, and Dhruv Kumar, “LLM-as-a-Judge for Time Series Explanations,” arXiv:2604.02118, 2026. https://arxiv.org/abs/2604.02118 ↩︎

The real comparison is narrator versus auditor#

TSQueryBench makes the model face controlled mistakes#

The four experiments are not four separate stories#

Generation fails unevenly, not randomly#

Ranking works because the candidate answers narrow the search#

Independent scoring is closer to a real QA gate#

Multi-anomaly detection shows sensitivity without discipline#

Why judging is easier than explaining#

The business value is assurance, not prettier commentary#

A practical architecture: generate, verify, then decide#

The limits are small enough to be useful and large enough to matter#

The useful lesson is role separation#