Storms are easy to see after they arrive. The harder question is what actually made them happen.
That distinction sounds academic until money enters the room. An insurer wants to know whether an event belongs to a changing regional risk pattern. A grid operator wants to understand whether a heatwave was driven by persistent blocking, moisture transport, or local feedback. A government agency wants a report fast enough to support preparedness, not just a polished explanation three months later. The weather event is visible. The mechanism is expensive.
This is where the paper “EWE: An Agentic Framework for Extreme Weather Analysis” becomes interesting.1 It is tempting to read EWE, short for Extreme Weather Expert, as another AI weather model. That would be the wrong box. EWE is not trying to beat GraphCast or Pangu-Weather at forecasting tomorrow’s atmosphere. It is trying to automate a different bottleneck: the expert diagnostic workflow that explains why an extreme event unfolded the way it did.
In plain terms, EWE is less “weather oracle” and more “junior meteorological analyst with tools, memory, charts, and an auditor standing behind it with a red pen.” Which, frankly, is already more governance than many dashboard products receive.
Prediction scaled first; diagnosis did not
The recent history of AI for weather has been dominated by prediction. Models such as Pangu-Weather, GraphCast, and FengWu are cited in the paper as examples of systems that improved medium-range forecasting. Their job is prognostic: produce future atmospheric states with high numerical skill.
But businesses rarely need prediction alone. After a damaging flood, cyclone, drought, cold wave, or heatwave, organizations need interpretation. They need to know what circulation pattern mattered, what moisture pathway intensified the event, whether mesoscale structure changed the damage profile, and how the event compares with climatological baselines. That is not a simple forecast output. It is a scientific narrative grounded in data.
The paper’s central observation is that this diagnostic work remains labor-intensive. Human experts manually gather meteorological fields, compute derived diagnostics, inspect visualizations, connect mechanisms across scales, and write an explanation. It is valuable precisely because it is not just a chart. It is a causal story constrained by physics.
That is the gap EWE tries to fill: not “can AI predict the storm?” but “can an AI agent conduct the structured investigation that a meteorologist would perform after or around the event?”
EWE is built as a diagnostic loop, not a one-shot chatbot
The paper formalizes extreme-weather diagnosis as an iterative trajectory of Thought, Action, Observation, and Interpretation. This matters because the task is not answer generation. It is workflow execution.
The agent first plans an investigation. It then retrieves data, generates or runs code, creates diagnostic visualizations, interprets the results, checks whether the output is valid, and continues. The loop repeats until the agent can assemble a physically coherent diagnosis.
A simplified version looks like this:
| Stage | What EWE does | Why it matters operationally |
|---|---|---|
| Thought | Decomposes the event into meteorological subquestions | Prevents the agent from jumping directly to a vague explanation |
| Action | Runs code, retrieves data, computes diagnostics, creates plots | Grounds reasoning in actual atmospheric fields |
| Observation | Reads numerical outputs and visualizations | Converts raw data into interpretable evidence |
| Interpretation | Links patterns to physical mechanisms | Produces the explanation that decision-makers actually need |
| Audit and revision | Checks code correctness and figure clarity | Reduces silent failure from bad code or unreadable charts |
This loop is the paper’s real contribution. The model is not being asked to “write an analysis of a storm” from memory. It is placed inside a structured environment where each step must touch data, tools, and visual evidence.
That difference is not cosmetic. In scientific domains, fluent language is cheap; correct intermediate work is expensive. EWE’s design tries to make the intermediate work visible.
The three mechanisms: planning, tools, and auditors
EWE combines three modules: knowledge-enhanced planning, self-evolving closed-loop reasoning, and a meteorological toolkit. The names are a little grand, as research-paper names often are. The underlying design is practical.
Expert-guided planning keeps the agent from improvising meteorology
The first mechanism is knowledge-enhanced planning using meteorological Chain-of-Thought guidelines. The authors manually annotate step-by-step diagnostic procedures for different categories of extreme events. These expert-style procedures are then used both for initial planning and as persistent memory during execution.
The goal is not to make the model sound clever. The goal is to constrain it.
A general-purpose model may know terms like “positive vorticity advection,” “integrated vapor transport,” or “potential temperature.” Knowing the words is not the same as knowing when to compute them, how to interpret them, and how to connect them across spatial and temporal scales. EWE’s planning module tries to turn latent meteorological knowledge into an ordered diagnostic path.
For business users, this is the first important lesson: domain agents should not merely be prompted with “act like an expert.” They need reusable expert workflows. Otherwise, the system becomes an eloquent intern wandering through a data warehouse with a flashlight.
The toolkit gives the agent physical grounding
The second mechanism is the meteorological toolkit. The paper says EWE retrieves data from the 0.25° ERA5 reanalysis dataset, including a 30-year climatology for anomaly analysis. Retrieved data are packaged as NetCDF files with metadata. The toolkit also provides expert-validated Python functions for domain-specific diagnostics.
This is where EWE stops being a language wrapper and becomes an analytical system.
The paper gives Integrated Vapor Transport as an example. A general model can write basic plotting code, but computing meteorological diagnostics correctly requires scientific formulas, variable handling, units, and domain conventions. If those functions are pre-verified, the agent is less likely to invent a calculation that looks plausible but quietly breaks the analysis.
The business implication is straightforward: in high-stakes analytical systems, the value often comes from the tool layer, not the chat layer. The model orchestrates; the tools compute. Confusing these roles is how organizations end up with impressive demos and brittle operations. Nature remains stubbornly unimpressed by interface polish.
Auditors catch errors that normal execution feedback misses
The third mechanism is the dual-auditor system. EWE uses a Code Auditor and a Content Auditor.
The Code Auditor looks for procedural issues such as wrong tool parameters or flawed data indexing. These are not always caught by runtime errors. A script can run successfully and still analyze the wrong variable, time window, pressure level, or location. That kind of error is especially dangerous because the output arrives with the confidence of a completed computation.
The Content Auditor focuses on the visual side: cluttered plots, bad contrast, occluded labels, excessive wind vectors, and other problems that make meteorological interpretation unreliable. The paper’s Figure 6 illustrates this with a 500 hPa geopotential height and wind map. The baseline version is visually cluttered; the auditor-guided version sparsifies the vector field, making the trough and flow structure easier to identify.
This is an underrated point. In weather analysis, visual clarity is not decoration. A bad chart can hide the mechanism. If an agent produces an unreadable map and then writes a confident interpretation, the problem is not aesthetics. It is evidence quality.
The benchmark evaluates the workflow, not only the final report
EWE also contributes a benchmark: 103 high-impact extreme-weather events from the past decade, drawn from EM-DAT and WMO State of the Global Climate reports. The dataset covers major IPCC AR6 event categories: temperature extremes, extreme precipitation, droughts, tropical cyclones, and extratropical cyclones.
The authors exclude events without demonstrable human impact. That is a useful design choice. The benchmark is not merely asking whether the agent can explain meteorologically interesting anomalies. It focuses on events that mattered socially or economically.
The dataset distribution is not perfectly uniform, and the paper says so. Cyclonic events account for 26.2% combined, extreme precipitation for 23.3%, heat waves for 17.5%, cold waves for 16.5%, droughts for 15.5%, and extratropical cyclones for 8.7%. Geographically, the sample is skewed toward Asia at 32.0%, Europe at 24.3%, and North America at 21.4%, with smaller shares from Africa, South America, and Oceania.
The evaluation is step-wise. Instead of judging only the final report, the benchmark scores multiple stages: planning, data exploration, event identification, synoptic-scale analysis, mesoscale analysis, thermodynamic analysis, and final reporting. It evaluates code fidelity, visualization quality, and depth of physical interpretation.
That design is important because a final report can hide many sins. An agent might produce a fluent conclusion after weak data exploration. It might generate correct code but poor interpretation. It might identify a weather pattern but fail to explain the physical link. Step-wise evaluation makes those failures harder to bury.
| Evaluation element | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| 103-event dataset | Main benchmark | Tests agents across event types and regions | Does not prove global operational readiness |
| Step-wise scoring | Main evaluation design | Reveals where the workflow succeeds or fails | Does not eliminate evaluator bias |
| Single-response grading | Main evidence | Scores each output independently | May miss subtle differences between plausible outputs |
| Comparative grading | Main evidence | Makes model-quality gaps more visible | Still depends on judge-model behavior |
| Ablation table | Ablation evidence | Shows which components contribute to performance | Uses a limited sample, so it is not a full sensitivity map |
| Figure-auditor example | Implementation evidence | Shows why visual correction matters | Does not quantify all visualization failure modes |
The results show model specialization, not one universal winner
The paper evaluates five MLLMs: Gemini-2.5-Pro, Claude-4-Sonnet, GPT-4.1-2025-04-14, Llama-4-Maverick, and o4-mini-2025-04-16. The scoring is normalized to $[0,1]$, where higher is better.
In single-response grading, o4-mini scores highest on planning at 0.974, while Claude-4-Sonnet leads in event identification at 0.783, mesoscale analysis at 0.700, thermodynamic analysis at 0.667, and final reporting at 0.981. GPT-4.1 is strongest in synoptic-scale analysis at 0.785 and also performs strongly in planning and data work.
In comparative grading, Claude-4-Sonnet becomes the clearest overall performer. It leads event identification at 0.832, synoptic analysis at 0.837, mesoscale analysis at 0.782, thermodynamic analysis at 0.750, and final report generation at 0.950. GPT-4.1 remains strongest in planning at 0.832 and data exploration at 0.624.
The interpretation should be careful. This is not a generic leaderboard for all weather AI tasks. It is a benchmark for an agentic diagnostic workflow under the authors’ evaluation setup. The useful finding is not “Claude is best at meteorology,” full stop. The useful finding is that different stages of scientific agent work stress different capabilities.
Planning, data handling, physical interpretation, and final synthesis are not one skill. They are a chain. In business systems, the weakest link in that chain determines whether the final report is useful.
The ablation study explains why the architecture matters
The paper’s ablation study is small but revealing. It tests the role of three components: analysis tools, auditors, and meteorological CoT. The full system uses all three. The baseline removes all three.
The reported scores are:
| Variant | Tools | Auditor | CoT | Synoptic | Mesoscale | Thermodynamic |
|---|---|---|---|---|---|---|
| No tools | No | Yes | Yes | 0.752 | 0.619 | 0.537 |
| No auditor | Yes | No | Yes | 0.768 | 0.636 | 0.665 |
| Baseline | No | No | No | 0.548 | 0.467 | 0.502 |
| Full EWE | Yes | Yes | Yes | 0.787 | 0.680 | 0.679 |
The full system performs best across all three dimensions. Compared with the baseline, it improves synoptic analysis by 0.239 and mesoscale analysis by 0.213. Removing the tools produces the biggest thermodynamic drop: from 0.679 to 0.537. That is exactly where one would expect tool support to matter because thermodynamic analysis depends heavily on precise calculations.
Removing the auditor also hurts performance, especially for mesoscale analysis, where the score falls from 0.680 to 0.636. This supports the paper’s argument that agents need more than execution feedback. A chart can be generated successfully and still be too cluttered to support interpretation.
The CoT result requires more caution. The paper does not include a variant where only CoT is removed while tools and auditors remain. Instead, it infers CoT’s foundational role from the poor baseline, where all three components are absent. That is suggestive, not cleanly isolated. The fair interpretation is that structured meteorological reasoning appears important, but the ablation design does not fully separate CoT’s independent marginal contribution.
That distinction matters because this is where many AI papers get over-read. The ablation supports the architecture. It does not prove that each module’s contribution has been completely disentangled.
The business value is faster diagnosis, not autonomous truth
For business readers, the practical pathway is not “replace meteorologists.” That is the lazy interpretation, and lazy interpretations are how systems become procurement theater.
The more realistic pathway is to turn rare expert workflows into semi-automated climate intelligence pipelines.
An insurer could use EWE-like systems to generate standardized post-event diagnostic reports before underwriting committees meet. A utility could analyze whether a heat event was linked to persistent large-scale circulation, local humidity, or compounding factors that affect load forecasting. A disaster agency could use the agent to create first-pass mechanism reports for multiple events across regions. A climate analytics vendor could turn event diagnosis into a repeatable service layer instead of bespoke consulting labor every time the sky misbehaves.
The key word is first-pass. EWE’s strongest business use is to reduce the cost and latency of structured analysis, not to certify final scientific truth. The agent can gather data, compute diagnostics, generate visualizations, classify mechanisms, and draft explanations. Expert review still matters, especially when the output influences capital allocation, insurance claims, regulatory reporting, or public safety.
A practical deployment architecture would likely look like this:
| Layer | Role in an EWE-like business workflow | Human oversight needed |
|---|---|---|
| Event intake | Define event window, location, type, and relevant impact context | Moderate, especially for ambiguous event boundaries |
| Data retrieval | Pull reanalysis, observational, or proprietary datasets | High during setup; lower after validation |
| Diagnostic computation | Run pre-verified meteorological functions | High for new diagnostics; routine for validated functions |
| Visualization | Produce maps, anomaly plots, and diagnostic charts | Medium, because bad charts distort interpretation |
| Agentic interpretation | Link patterns to physical mechanisms | High for high-stakes conclusions |
| Report generation | Produce standardized event intelligence | Medium to high depending on use case |
| Audit trail | Store code, data sources, plots, scores, and revisions | Essential for governance and compliance |
This is where the ROI argument becomes concrete. EWE-like systems may reduce expert hours spent on repetitive data preparation, plotting, and first-draft synthesis. They may also improve consistency across reports because every event passes through a common evaluation structure. But the value depends on tool reliability, dataset coverage, auditability, and the cost of expert review.
In other words: the agent saves time only if the review process is designed, not improvised afterward in a shared folder named “final_final_v7.”
The boundaries are not footnotes; they define deployment
The paper is promising, but its boundaries are operationally important.
First, the study relies on ERA5 reanalysis data in its current implementation. ERA5 is powerful, but a business deployment may need real-time observational streams, proprietary sensor feeds, satellite data, local radar, or higher-resolution regional models. The paper explicitly mentions future extension to real-time observational data streams. Until then, EWE should be read primarily as a diagnostic framework, not a live warning system.
Second, the evaluation uses MLLM-as-a-judge. The authors initially discuss both GPT-4.1 and Gemini-2.5-Pro as judges, but in experiments they designate GPT-4.1 as the sole judge after observing that Gemini tended to penalize correct code. That choice may improve consistency, but it also concentrates evaluation dependence in one judge model. For enterprise use, independent expert review and task-specific validation would still be necessary.
Third, the ablation study is conducted on a limited set of samples. It supports the importance of tools, auditors, and structured reasoning, but it should not be treated as a complete robustness analysis across every event type, geography, or data condition.
Fourth, the benchmark covers 103 high-impact events, which is a meaningful starting point but not a complete universe of extreme-weather mechanisms. The paper itself notes possible expansion to underrepresented event types. This matters because climate-risk workflows are often local. A system that performs well on broad global cases may still need adaptation for specific regions, hazards, and institutional decisions.
Finally, the output is diagnostic, not causal in the strongest scientific sense. EWE can assemble a physically grounded explanation from data and tools. That is valuable. But formal attribution, liability decisions, or regulatory-grade climate claims may require additional methods, uncertainty quantification, and human sign-off.
The deeper lesson: scientific agents need operating discipline
The most interesting part of EWE is not that it uses an MLLM. Everyone uses an MLLM now; it is practically the new office furniture.
The interesting part is that EWE treats scientific reasoning as an operating system. It combines expert templates, validated tools, data access, visualization, memory, feedback, and evaluation. The language model is important, but it is not asked to carry the whole epistemic burden alone.
That design lesson travels beyond meteorology. In finance, drug discovery, compliance, engineering, and supply-chain risk, organizations face similar problems: high-dimensional data, scarce experts, difficult causal narratives, and expensive interpretation. EWE suggests that the path forward is not a single smarter model answering from a prompt. It is an agent constrained by workflows, tools, intermediate checks, and domain-specific evaluators.
For climate risk specifically, this paper points toward a future where extreme-weather intelligence becomes more standardized and more scalable. Not perfect. Not autonomous truth. But faster, more structured, and easier to audit than today’s fully manual diagnostic bottleneck.
That is already a serious shift. Prediction tells you what may happen. Diagnosis tells you what happened, why it happened, and what kind of risk pattern you may be dealing with next time. Businesses need both.
EWE is an early attempt to automate the second half of that equation. The storm still arrives on its own schedule. At least the analysis may not have to.
Cognaptus: Automate the Present, Incubate the Future.
-
Zhe Jiang et al., “EWE: An Agentic Framework for Extreme Weather Analysis,” arXiv:2511.21444, 2025. https://arxiv.org/abs/2511.21444 ↩︎