Opening — Why this matters now
Extreme weather is no longer a footnote in climate reports—it’s a recurring headline. Storms intensify, heat waves lengthen, and infrastructure creaks under the weight of unpredictability. Yet the most valuable part of understanding these events—the diagnostic analysis of how and why they formed—remains trapped in a slow, expert‑only workflow. Prediction has scaled; understanding has not.
Enter EWE (Extreme Weather Expert), a new agentic framework that tries to automate what meteorologists do after the storm hits. Think of it as an AI that doesn’t just forecast—it explains. In a world where climate risk pricing, agricultural planning, and disaster‑response coordination depend on rapid insights, automating meteorological diagnosis is no longer a luxury. It’s an economic survival mechanism.
Background — Context and prior art
In Earth sciences, AI progress has leaned heavily toward forecasting. Models like Pangu-Weather, GraphCast, and FengWu pushed medium‑range accuracy to impressive heights. But these models don’t tell you why an event happened. They don’t stitch together multi-level circulation patterns or explain cause-and-effect in human language.
Meanwhile, attempts to inject interpretability—such as explainable AI for climate—have mostly resulted in static, post-hoc overlays that still require an expert to interpret. And when LLMs entered the scene, they brought reasoning ability but lacked grounding in physical data.
The result: powerful tools for prediction, but almost nothing for scalable, data-driven diagnosis—the part that governments, insurers, and grid operators desperately need.
Analysis — What the paper does
The paper proposes EWE, the first agent designed explicitly for extreme weather diagnosis. Rather than treating analysis as a single task, EWE operationalizes the entire expert workflow:
- Knowledge‑Enhanced Planning — EWE starts with meteorologist-style Chain‑of‑Thought reasoning, structured around expert-authored diagnostic templates.
- Self‑Evolving Closed‑Loop Reasoning — A Think → Act → Observe → Interpret loop ties the agent’s reasoning to real meteorological data. A dual-auditor system (code + visualization) prevents subtle scientific errors.
- Meteorological Toolkit — Pre‑verified Python tools retrieve ERA5 data (0.25° global resolution) and compute domain-specific diagnostics like IVT, vorticity, and potential temperature.
- End-to-End Workflow Execution — EWE generates visualizations, interprets them with an MLLM, and synthesizes results into a final report.
The key innovation is integrating structured reasoning, code execution, data retrieval, and multimodal interpretation into a single autonomous diagnostic cycle.
A benchmark for the new field
To validate EWE, the authors built the first 103‑event extreme weather diagnosis dataset, representing heatwaves, cold waves, droughts, tropical cyclones, extratropical storms, and extreme precipitation—annotated with local timezones and multi-year climatology baselines.
A stepwise evaluation rubric scores:
- Code correctness
- Visualization clarity
- Physical diagnostic insight
This granular scoring avoids the common trap of evaluating only the final answer.
Findings — Results with visualization
EWE is compared across major MLLMs, scored in seven stages: planning, data exploration, event identification, synoptic analysis, mesoscale analysis, thermodynamic analysis, and final reporting.
A distilled comparison:
| Model | Strengths | Weaknesses |
|---|---|---|
| Claude‑4 Sonnet | Top performer in mesoscale, thermodynamics, and final reporting | Slightly weaker early-stage planning |
| GPT‑4.1 (2025) | Excellent planning and synoptic analysis | Mid‑tier mesoscale performance |
| Gemini‑2.5 Pro | Strong planning | Inconsistent code evaluation in CG mode |
| Llama‑4 Maverick | Reasonable data exploration | Poor high‑level physical diagnosis |
| o4-mini (2025) | Strong planner, efficient | Weak final synthesis |
Ablation tests clarify which components matter most:
| Removed Component | Impact |
|---|---|
| Meteorological Toolkit | Severe degradation in thermodynamic tasks |
| Auditor | Mesoscale patterns misdiagnosed, unclear charts |
| CoT Planning | Analysis collapses entirely—agent cannot structure reasoning |
The toolkit and auditor enable correctness, while CoT structures coherence.
Conceptual Diagram — EWE Workflow
User Query → Plan (expert CoT) → Execute Code → Generate Plots → Auditor Check → Interpret → Iterate → Final Diagnosis
Implications — What this means for the AI ecosystem
EWE signals a shift from AI as predictors to AI as scientific analysts. For businesses and institutions, this shift reshapes the climate‑risk stack:
1. Climate Intelligence Becomes Scalable
Experts currently diagnose only a small fraction of impactful weather events. EWE-like systems could:
- analyze every storm globally,
- produce standardized, auditable reports,
- shorten analytic latency from days to minutes.
2. Insurance, Energy, and Agriculture Gain a Real-Time Diagnostic Layer
Extreme weather underwriting can integrate near-real-time causal explanations, improving pricing accuracy and risk transfer.
Grid operators can model storm-driven load shifts using agent-generated mesoscale reasoning.
Agricultural planners can receive rapid diagnosis of drought mechanisms rather than waiting for seasonal summaries.
3. Regulatory Pressure Will Increase
As climate risk disclosure frameworks (ISSB, SEC, EU) mature, companies will need traceable, data-backed meteorological interpretations. EWE-like agents help meet these requirements at scale.
4. AI Governance Must Catch Up
Agentic systems executing code on scientific datasets raise governance concerns:
- robustness of closed-loop reasoning,
- reliability of auto-generated diagnostics,
- accountability for incorrect causal claims.
The benchmark introduced here offers a path toward standardizing evaluation.
Conclusion — Wrap-up
EWE is not just another academic agent: it’s an early prototype of a new automation frontier—scientific reasoning agents grounded in physical data. While still experimental, its structure hints at a future where meteorological diagnosis is fast, consistent, and globally accessible.
For economies increasingly shaped by climate volatility, that future cannot arrive soon enough.
Cognaptus: Automate the Present, Incubate the Future.
fileciteturn0file0