Forecasting the Forecasters: How Hierarchical LLM Meteorologists Rewrite Weather Reasoning

Weather reports look simple only after someone has already done the hard part.

A forecast table can tell you that temperature drops, rain appears, wind direction shifts, humidity stays high, and visibility changes. That is data. A useful report tells you whether this is a mild autumn transition, a tropical shower pattern, a frontal passage, a flood warning, or merely Tuesday being dramatic again.

That gap between numbers and interpretation is the real subject of Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting.¹ The paper is not introducing a new weather prediction model. It is not trying to beat GraphCast, Pangu-Weather, or any other machine-learning weather model on numerical accuracy. It treats forecast models and weather APIs as sources of structured tables, then asks a more operational question: how can those tables become readable, consistent, and checkable weather narratives?

That distinction matters. Many business users do not consume weather as raw forecast grids. They consume it as an operational sentence: delay outdoor work, prepare drainage crews, shift delivery windows, reduce energy demand assumptions, warn coastal communities, or do nothing because the sky is only being moody for attention.

The paper’s answer is a hierarchical LLM-agent pipeline. It converts hourly forecasts into daily and six-hour aggregates, asks a “Meteorologist” agent to reason over those scales, extracts weather keywords, attaches a proof block, and passes the result to a “Writer” agent for domain-specific presentation. The clever part is not that an LLM writes prose. We have all seen enough fluent nonsense to stop clapping for grammar. The clever part is that the system tries to make every important phrase in the prose accountable to visible patterns in the data.

The paper is about interpreting forecasts, not forecasting the weather

The easiest misunderstanding is also the most damaging one: “AI meteorologist” sounds like a model that predicts the weather.

Here, it means something narrower and more useful for many enterprises. The system starts after the forecast exists. Its inputs include location metadata, climatological context, and hourly forecast tables from OpenWeather, with Meteostat or ERA5-style climatology used for background comparison. The paper also discusses operational texts such as NOAA Area Forecast Discussions as a weak consistency reference for U.S. contexts, though the core demonstration does not depend on matching those texts.

The pipeline is therefore closer to an automated forecaster’s discussion generator than to a numerical weather prediction engine. It sits between weather data providers and decision-makers.

That placement changes how we should judge the work. The question is not: “Did it predict rain better?” The question is: “Given forecast tables, did it describe the event at the right temporal scale, choose sensible weather concepts, avoid false warnings, and show enough evidence for a human to check the claim?”

The paper evaluates exactly that style of output quality: consistency between summaries and tabular aggregates, alignment between keywords and observed patterns, and adequacy of proof or warnings relative to forecast data and climatology.

The mechanism starts by stopping hourly data from bullying the model

Hourly forecast tables are useful and irritating for the same reason: they contain detail.

For a short horizon, hourly rows can reveal timing. For a five- or six-day horizon, they can also drown the reader in local fluctuations. An LLM given a large hourly table may over-attend to scattered rows, describe noise as structure, or miss the daily pattern that a human forecaster would see immediately. A machine with a large context window is still perfectly capable of staring at the wrong part of the spreadsheet. Very relatable, unfortunately.

The paper’s first mechanism is hierarchical temporal aggregation. The Assistant block builds non-overlapping six-hour and daily windows. Temperature is summarized with mean, minimum, and maximum. Relative humidity and wind speed are averaged. Precipitation is summed. Wind direction is handled carefully: it is averaged on the circle for six-hour windows, but omitted from daily records to avoid misleading circular averages.

That last detail is small but revealing. Wind direction is not an ordinary scalar. Averaging 350 degrees and 10 degrees as if they were ordinary numbers gives nonsense. The paper’s decision to keep six-hour circular direction but avoid daily circular averages is not glamorous, but operational systems often live or die on exactly this kind of boring correctness.

The result is two possible context modes:

Context mode	What it contains	Why it matters
Baseline context	Location, climatology, and full hourly table	Useful for short-range detail, but less structured for longer interpretation
Hierarchical context	Location, climatology, daily table, six-hour table, and sometimes hourly rows	Separates persistent trends from intraday variation and reduces attention bias toward noisy tables

For lead times under five days, the system may include hourly rows. For five- to ten-day horizons, it omits the hourly grid to save tokens and reduce attention bias. That is a practical design choice: do not make the model carry every row when the decision problem needs trend, event, and risk structure.

The Meteorologist agent turns tables into four checkable fields

The first LLM stage is the Meteorologist. It receives the structured context and returns four fields:

Field	Function	Operational value
`summary`	Multi-paragraph narrative grounded in the tables	Gives decision-makers a readable account of the forecast horizon
`proof`	Compact evidence block listing observable table patterns	Makes claims easier to audit against data
`keywords`	Three to five weather descriptors	Compresses the dominant events into semantic anchors
`warnings`	Optional hazard flags with data-grounded justification	Supports risk communication without burying warnings in prose

The important move is the coupling between keywords and proof. A keyword such as heavy_rain, strong_wind, or frontal_passage should not float freely as decorative language. It should map back to observable aggregates: precipitation totals, wind speed changes, wind-direction shifts, humidity levels, temperature changes, or climatological deviations.

This turns the generated text into a semi-auditable object. Not fully verified, not formally proven, but less slippery than ordinary LLM prose. A human reviewer can ask: “Where did this keyword come from?” The proof block is supposed to answer.

That is the core mechanism of the paper: the LLM is not merely writing; it is writing around semantic handles that can be inspected.

The Writer agent is useful precisely because it is not supposed to invent facts

After the Meteorologist creates the structured analysis, the Writer agent adapts it to a target domain and style. The paper mentions possible users such as power engineers, urban planners, agronomists, risk analysts, and extreme-weather users.

This division of labor is sensible. In many enterprise systems, the same underlying event must be narrated differently for different users:

User type	Same weather event becomes
Emergency management	Flooding risk, warning priority, exposed districts
Energy operations	Demand shift, cooling load, wind or storm disruption
Logistics	Delay risk, outdoor handling constraints, route timing
Agriculture	Soil moisture, irrigation decisions, field access
Insurance	Exposure monitoring, claim-risk pre-alerts

But the paper also sets an important boundary: the Writer is not supposed to change the factual basis produced by the Meteorologist. Its job is formatting and domain adaptation, not meteorological reinterpretation.

That separation is a useful architectural principle for AI reporting systems beyond weather. Put factual interpretation in one stage. Put audience adaptation in another. Then expose the context used by the system, so downstream users can see whether the report came from daily, six-hour, hourly, climatological, or location data. Boring JSON fields, once again saving civilization from elegant nonsense.

The case studies are qualitative evidence, not a benchmark leaderboard

The paper demonstrates the system on four locations with distinct weather regimes: Cork, Manila, Chennai, and Hai Châu in Da Nang. Each case covers roughly a five- to six-day forecast horizon in late October 2025, using hierarchical mode.

These examples are the paper’s main qualitative evidence. They are not an ablation study, not a large benchmark, and not a statistical proof that the system generalizes. Their purpose is to show whether the mechanism behaves plausibly across different weather situations.

Paper component	Likely purpose	What it supports	What it does not prove
Figure 1 architecture	Implementation detail	The system decomposition into Assistant, Meteorologist, Writer, tools, and context modes	That the architecture outperforms alternatives
Figure 2 Da Nang report	Main qualitative evidence	Warning generation and proof-keyword linkage under extreme rainfall and wind	Calibrated warning accuracy across regions
Figure 3 supplement context block	Implementation detail	How location, climatology, and forecast tables are serialized	That the chosen serialization is optimal
Figure 4 Cork/Manila/Chennai reports	Main qualitative evidence	Regional adaptation and narrative consistency across different regimes	Statistical robustness or comparative superiority
Future work on AFD benchmark, ensemble interpretation, ReAct tooling	Exploratory extension	A roadmap toward stronger validation and uncertainty-aware reporting	Present system capability

This matters because the paper sometimes uses strong language about improvement. The evidence is best read as a prototype demonstration: the system produces plausible, checkable reports in selected cases, and the authors identify the next steps needed for more rigorous evaluation.

Cork shows trend extraction under mild coastal conditions

The Cork example is a clean case of trend extraction. The paper reports daytime maxima decreasing from roughly 14.4–15.5°C early in the window to about 10–11.7°C by October 23–24. Relative humidity often exceeds 80%. Precipitation is intermittent and light, reaching a daily total of up to 5.4 mm on October 23. Winds are moderate, roughly 2–7.8 m/s, with a shift from southerly to west-northwesterly.

The generated keywords include cooling_trend, light_rain, moist_conditions, frontal_passage, and autumn_transition.

This is where hierarchy helps. Hourly rows might show scattered rain and wind changes. The daily aggregates reveal the cooling trend. The six-hour aggregates preserve enough direction and timing to support the frontal-passage interpretation. The proof block then ties the words back to declining temperature maxima, elevated humidity, small precipitation totals, and directional wind shift.

For business use, this is not an emergency case. It is a coherence case. The value lies in turning a table into a stable operational narrative without exaggerating the risk. Not every forecast needs a red banner. Sometimes the best AI output is the one that resists the temptation to make drizzle feel cinematic.

Manila reveals why regional thresholds are not a footnote

The Manila case is more interesting than it first appears. The report describes typical warm, humid tropical conditions: maxima around 27.6–31.8°C, minima around 25.4–26.9°C, humidity between 66% and 90%, infrequent light precipitation, and light-to-moderate winds mostly from the east to southeast.

The keywords include light_rain, stable_conditions, marine_influence, warm_anomaly, and clear_sky.

The paper notes a local ambiguity around warm_anomaly: maxima may be slightly below the climatological mean, depending on the reference and threshold. This is not just a tiny labeling issue. It is a warning about portability. A keyword that works in one climate regime may become fragile in another if thresholds are not localized.

For enterprise deployment, this is the difference between a useful automated report and a system that irritates local experts. Tropical climates, monsoon regimes, coastal microclimates, and urban heat effects do not always behave like neat mid-latitude textbook examples. A reporting system can be architecturally correct and still regionally tone-deaf.

The paper identifies this clearly: threshold and reference-normal calibration need refinement for tropical contexts.

Chennai and Da Nang show the warning pathway more clearly

The Chennai case shows a transition from humid and windy conditions toward cooler and drier conditions. The paper reports maxima falling from about 30.7°C on October 21 to about 25.3°C on October 24, minima falling from about 26.6°C to 23.4°C, humidity remaining high at 73–92%, and wind speeds increasing from about 3.3 m/s to above 10 m/s by October 24. Rainfall is substantial early, up to 27.7 mm daily on October 20, then tapers.

The keywords include heavy_rain, frontal_passage, strong_wind, overcast, and stable_conditions. The proof block connects these labels to early precipitation totals, rising wind speed, wind-direction shift, and decreasing daily temperature maxima.

Da Nang is the stronger hazard example. The system reports persistently high humidity and frequent precipitation, with an extreme daily precipitation total above 130 mm on October 23 and wind speeds up to 9.2 m/s. The warning field is triggered for localized flooding or waterlogging risk. The keywords include heavy_rain, strong_wind, frontal_passage, overcast, and unstable_airmass; the proof cites very high humidity, extreme precipitation, strengthening wind, and directional shift.

This is the most business-relevant demonstration in the paper because it shows the full reporting chain: aggregate detection, narrative summary, keyword selection, proof support, and warning output.

Still, it remains a demonstration. We should not read one Da Nang case as calibrated flood-warning performance. The system identifies an obvious extreme in the table. That is useful, but the next enterprise question is harder: how often does it over-warn, under-warn, or misclassify borderline events across seasons and regions?

The business value is decision translation, not magical meteorology

The practical value of this framework is not that it replaces meteorological models. It makes their outputs easier to use.

A business user rarely wants a raw matrix of hourly temperature, humidity, wind, visibility, and precipitation. They want a decision layer. For weather-sensitive operations, the expensive failure is often not “we lacked data,” but “we had data and did not convert it into a timely operational interpretation.”

The paper’s architecture suggests a useful pattern:

Technical contribution	Operational consequence	ROI relevance
Daily and six-hour aggregation	Separates persistent trends from intraday noise	Reduces analyst time spent reading raw hourly tables
Keyword generation	Compresses dominant weather events	Enables search, alerting, routing, and dashboard summaries
Proof block	Links report claims to data patterns	Improves reviewability and lowers blind trust in LLM prose
Warning field	Highlights hazardous conditions	Supports faster escalation for emergency or asset-protection workflows
REST-style integration and caching	Makes report generation reproducible	Allows replay, audit, model comparison, and API deployment
Writer stage	Adapts reports to user domains	Supports multiple business audiences from the same forecast context

The strongest business use cases are those where the forecast already exists but interpretation is fragmented:

emergency management dashboards that need short, auditable hazard narratives;
logistics systems that translate weather into route and timing constraints;
energy operations that summarize cooling, heating, wind, and storm-relevant changes;
agriculture advisories that explain rain timing, humidity, and field-access conditions;
insurance and climate-resilience monitoring that need structured incident pre-alerts.

The Cognaptus inference is that this architecture is especially useful as a reporting layer on top of existing weather APIs or AI weather models. It does not need to own the forecast model to create value. It needs to own the interpretation pipeline.

The proof block is a lightweight audit trail, not formal verification

The phrase “proof” needs careful handling. The paper’s proof block is not a mathematical proof and not a programmatic verifier. It is an evidential rationale: a structured text block that lists observed patterns supporting the generated keywords and warnings.

That is still useful. It gives reviewers something to check. It also creates a natural target for future automated validation: if a keyword says heavy_rain, a validator can inspect precipitation aggregates; if a warning says strong wind, it can check wind speed and gust thresholds; if the summary mentions cooling, it can test daily maxima or minima.

But in the current paper, the system does not perform separate programmatic hypothesis testing. The authors state this limitation directly. Verifiability comes from the summary-proof-keyword linkage, not from an independent rules engine that certifies every claim.

This distinction is important for enterprise adoption. A proof block improves transparency. It does not eliminate the need for monitoring, thresholds, human review, or validation logs in high-stakes settings.

What the paper directly shows, and what remains uncertain

A disciplined reading separates the demonstrated result from the plausible product path.

Layer	What the paper shows	What remains uncertain
Mechanism	Hierarchical context can structure forecast interpretation across daily, six-hour, and optional hourly scales	Whether this is optimal against other summarization or retrieval strategies
Output design	Keywords plus proof blocks make reports more inspectable	Whether automated validators can reliably detect all keyword-data mismatches
Case behavior	Four qualitative examples show plausible summaries, keywords, and warnings	General performance across many regions, seasons, extreme-event types, and languages
Warning behavior	The Da Nang case triggers a flooding-relevant warning; non-extreme cases do not trigger false alarms	Calibration of warning thresholds, false positives, and false negatives
Deployment architecture	REST-style services, caching, and JSON outputs support reproducibility	Production reliability under API failures, model changes, and real-time operating constraints
Business relevance	The system can serve as a bridge from forecast data to decision-support text	Actual ROI depends on workflow integration, alert fatigue, and domain-specific validation

The future-work section is therefore not decorative. It identifies the missing pieces needed to move from promising prototype to operational system: an AFD-style benchmark and error taxonomy, a critic-corrector loop, ensemble-aware interpretation with probability-tagged keywords, and ReAct-style tools for targeted validation.

The ensemble point is especially important. Real weather decisions often depend on uncertainty, not just the mean forecast. A report that says “heavy rain” is useful. A report that says heavy rain has a 40–60% probability, with spread increasing after day four, is more useful. It also gives managers a way to match decisions to risk tolerance instead of pretending the future has already signed the delivery receipt.

The broader lesson: agentic reporting works best when the agents are boxed in

This paper is a good example of a broader pattern in enterprise AI: LLM agents become more useful when they are constrained by structure.

The Assistant does data collection and aggregation. The Meteorologist interprets structured context into summary, proof, keywords, and warnings. The Writer adapts presentation without changing facts. The output includes the context used. Caching allows replay. The proposed future validators would compare claims back to aggregates.

This is not the fantasy version of agentic AI where a model wanders freely through tools and returns wisdom. It is a bounded workflow. Each agent has a job. Each output has a schema. Each claim is supposed to point back to evidence. The result is less magical and more deployable. Good. Magic is a terrible systems architecture.

For Cognaptus-style business automation, this is the part worth remembering. The value is not “LLM writes weather reports.” The value is a repeatable interpretation pipeline:

collect structured operational data;
aggregate it at decision-relevant scales;
generate semantic labels;
attach evidence;
adapt to user context;
expose the source context for review;
later add validators and correction loops.

Weather is only one domain. The same pattern can apply to supply-chain risk reports, financial market surveillance, customer-service incident summaries, industrial sensor monitoring, and policy brief generation. Anywhere tables become decisions, a hierarchical proof-linked narrative layer can reduce cognitive load.

Conclusion: from forecast table to accountable narrative

Hierarchical AI-Meteorologist is not trying to make the sky more predictable. It is trying to make forecast interpretation more usable.

That may sound less glamorous, but it is exactly where many operational bottlenecks live. Forecast APIs already produce abundant data. The harder business problem is turning that data into a report that says what is happening, why the system thinks so, what risk deserves attention, and which evidence supports the claim.

The paper’s mechanism-first contribution is clear: daily and six-hour hierarchy reduce noise; keywords compress events into semantic anchors; proof blocks make claims easier to inspect; REST-style integration makes the workflow reproducible. The evidence is still early, based on four qualitative case studies rather than a large benchmark. Threshold localization, uncertainty handling, and programmatic validation remain necessary before this becomes a robust operational forecaster’s assistant.

But the direction is useful. The future of AI weather reporting is probably not a poetic chatbot staring at hourly rows and improvising atmospheric drama. It is a structured agent pipeline that knows which data it used, which scale it reasoned over, which words it chose, and which evidence those words must answer to.

That is less theatrical. It is also much closer to something a business can trust.

Cognaptus: Automate the Present, Incubate the Future.

Daniil Sukhorukov et al., “Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting,” arXiv:2511.23387, 2025. ↩︎

The paper is about interpreting forecasts, not forecasting the weather#

The mechanism starts by stopping hourly data from bullying the model#

The Meteorologist agent turns tables into four checkable fields#

The Writer agent is useful precisely because it is not supposed to invent facts#

The case studies are qualitative evidence, not a benchmark leaderboard#

Cork shows trend extraction under mild coastal conditions#

Manila reveals why regional thresholds are not a footnote#

Chennai and Da Nang show the warning pathway more clearly#

The business value is decision translation, not magical meteorology#

The proof block is a lightweight audit trail, not formal verification#

What the paper directly shows, and what remains uncertain#

The broader lesson: agentic reporting works best when the agents are boxed in#

Conclusion: from forecast table to accountable narrative#