Synthetic Defenders: How Generative AI Reinvents Smart Grid Security

TL;DR for operators

A digital substation does not need an AI poet. It needs a detector that notices when a GOOSE message behaves just wrong enough to matter.

The paper behind this article makes two claims that should be kept separate. First, it proposes Advanced Adversarial Traffic Mutation, or AATM, as a way to generate synthetic IEC61850 GOOSE datasets that are more balanced and more protocol-realistic than a conditional GAN baseline. Second, it evaluates a GenAI-based task-oriented dialogue anomaly detection system, implemented with Anthropic Claude Pro, against FNN, RNN, and SVM baselines on 5,000 AATM-generated GOOSE datasets.¹

The strongest evidence is in the comparison, not the branding. AATM improves Balance Rate from 0.454 to 0.877 versus CGAN, and Realism Rate from 0.718 to 0.849. The GenAI detector reports 97.5% accuracy, 97.9% F1-score, and 0.945 Matthews correlation coefficient, while the ML baselines cluster around 87–88.5% accuracy and roughly 0.764–0.776 MCC.

For an operator, the implication is not “let an LLM run the substation.” Please do not give the toaster root access to the grid. The useful interpretation is narrower and more practical: synthetic, rule-aware attack data can improve how security teams test anomaly detectors, and a GenAI-style detector may provide more interpretable triage when the anomaly depends on relationships among time, sequence numbers, state numbers, MAC fields, and data changes.

The business path is therefore: improve scarce attack-data generation, train and benchmark detectors under more balanced threat coverage, use contextual outputs to support analyst review, and only then consider operational integration. This is pilot material for utility SOCs, substation cybersecurity vendors, and testbed teams. It is not yet evidence of certified autonomous defence across live grid infrastructure.

The first comparison is about data, not detection

The easiest way to misread this paper is to jump straight to the GenAI detector and treat the synthetic-data step as plumbing. That would be convenient. It would also be wrong.

In digital substations, anomaly detection depends on examples of things that rarely happen, should not happen, and ideally must not be allowed to happen in production. That creates a data problem. Normal traffic is easier to observe. Rare anomalies, replay attacks, data injection, timing irregularities, packet loss, and protocol-specific errors are harder to collect at useful volume. If a detector mostly sees normal patterns, it becomes excellent at recognizing normality and politely useless at recognizing the rare event that ruins the day.

The paper’s first contribution is AATM: a protocol-aware synthetic-data generation method designed for IEC61850 GOOSE messages. GOOSE, or Generic Object Oriented Substation Event, is used for fast event communication in digital substations. The relevant features are not just generic packet columns. They include fields such as destination and source MAC addresses, APPID, dataset, goid, state number, sequence number, timestamps, and binary data values. These fields have relationships. Some changes are normal. Some are suspicious. Some are suspicious only when another field did not change. Naturally, this is where ordinary tabular modelling begins to sweat.

The authors compare AATM with a conditional generative adversarial network. CGAN is the obvious modern baseline: train a generator and discriminator to produce class-conditioned synthetic samples. AATM takes a different route. Instead of training a neural generator from scratch, it applies perturbations and categorical mutations to existing GOOSE samples under explicit protocol constraints. It uses functions for protocol compliance, balance, and novelty, then guides numerical and categorical changes so that generated attacks can deliberately violate selected rules while still resembling plausible GOOSE traffic.

That distinction matters. A synthetic attack record is not useful merely because it is synthetic. It is useful if it falls into the awkward zone between nonsense and normality: realistic enough to test a detector, abnormal enough to represent a threat, and diverse enough not to train the next model into a narrow corner. The paper’s data-generation comparison is really a test of whether the synthetic data generator understands that awkward zone.

AATM beats CGAN by respecting the protocol’s boring details

The paper defines eight GOOSE rules covering sequence progression, data-change logic, state-number behaviour, field integrity, timestamp format, message frequency, transmission gaps, and replay-like patterns. These rules provide the scaffolding for both synthetic-data generation and anomaly reasoning.

AATM’s advantage comes from using that scaffolding directly. For numerical features, it combines protocol compliance, class balance, and novelty objectives. For categorical features, it defines transition logic so fields such as destination MAC, source MAC, type, APPID, dataset, and goid can be mutated within valid sets rather than hallucinated as arbitrary category soup. This is less glamorous than a large neural generator. It is also exactly the kind of dull engineering that tends to survive contact with industrial protocols.

The empirical contrast is sharp:

Data-generation result	CGAN	AATM	Practical interpretation
Balance Rate	0.454	0.877	AATM produces a much more even class distribution across normal, attack, and error categories.
Realism Rate	0.718	0.849	AATM better preserves protocol credibility under the paper’s GOOSE rule checks.
Normal class share	19.2%	7.0%	CGAN over-represents common traffic; AATM reduces dominance by majority classes.
SP-time class share	2.2%	6.7%	AATM gives more coverage to a minority timing-error class.
Zero-day class share	3.2%	10.8%	AATM deliberately expands representation of rare, novel-style anomalies.

The Balance Rate result is the clearest. CGAN does not merely fail to fix imbalance; in the authors’ analysis, it can intensify it by reproducing the statistical dominance of common classes. That is not a moral failure by CGAN. It is a reminder that generative models tend to learn what the data makes easy, and rare operational anomalies are not what the data makes easy.

The Realism Rate result is equally important but easier to overstate. AATM’s RR of 0.849 means its generated samples better satisfy the paper’s protocol-compliance scoring framework than the CGAN samples do. It does not mean the data has been validated against every substation configuration, vendor implementation, protection scheme, or field condition. A protocol-aware synthetic dataset is closer to the battlefield than a naïve one. It is still a map.

The second comparison asks whether GenAI detects relationships better than baselines

Only after the synthetic data problem is addressed does the paper move to the detector comparison. This sequencing matters. The detector is evaluated on AATM-generated GOOSE datasets, not on a broad public collection of live utility incidents. The detector result is therefore best interpreted as: given this synthetic, protocol-aware, balanced test environment, how do different anomaly detection approaches compare?

The authors benchmark four systems: FNN, RNN, SVM, and a GenAI-based task-oriented dialogue ADS implemented with Anthropic Claude Pro. The GenAI setup is not presented as a generic chatbot casually chatting with packets. It is framed as a task-oriented dialogue system using GOOSE rules, belief-state style tracking, SQL-like rule checks, and contextual interpretation of message sequences.

This is where the comparison becomes interesting. FNNs and SVMs can work well when discriminative patterns are stable in feature space. RNNs add temporal memory, which should help with packet sequences. But GOOSE anomalies often depend on field relationships: whether sqNum increments correctly under the same sender/receiver pair, whether stNum changes when data changes, whether a repeated data pattern suggests replay, whether a time gap crosses a suspicious threshold, or whether a categorical field changes when it should not.

A detector that can reason across rules and context has an architectural advantage in this setup. That does not make it magical. It means the benchmark favours systems that can combine sequence, rule, and semantic relationships. In this paper, the GenAI ToD detector appears to benefit from precisely that combination.

The headline metrics are:

Metric	FNN	RNN	SVM	Claude Pro GenAI ADS
TPR	79.0%	87.9%	79.1%	97.9%
FPR	0.0%	10.6%	0.0%	3.2%
FNR	21.0%	12.08%	20.9%	2.1%
Precision	100.0%	92.5%	100.0%	97.9%
Accuracy	87.4%	88.5%	87.4%	97.5%
F1-score	88.3%	90.2%	88.3%	97.9%
Markedness	0.760	0.756	0.761	0.947
Informedness	0.790	0.773	0.791	0.947
MCC	0.775	0.764	0.776	0.945

The paper’s prose at one point refers to 97.9% classification accuracy, but its table reports 97.5% accuracy and 97.9% for TPR, precision, and F1-score. The table is the safer number to use. Tiny clerical inconsistency, large interpretive consequence. Welcome to empirical papers.

The most operational number is the missed-anomaly rate

Accuracy is the friendly metric. It smiles, shakes hands, and hides class imbalance in its jacket pocket.

For substation anomaly detection, the more revealing metric is often the false negative rate: the share of actual anomalies the system misses. In the paper’s results, the GenAI ADS reports an FNR of 2.1%. FNN reports 21.0%, RNN reports 12.08%, and SVM reports 20.9%. That is the operationally meaningful spread.

A detector with a very low false positive rate can still be dangerous if it earns that calm dashboard by missing real anomalies. FNN and SVM both show 0% FPR in the table, but their FNR values are around 21%. In a security setting, silence is not automatically virtue. Sometimes silence is just the alarm system taking a nap.

The GenAI detector’s trade-off is more balanced: 3.2% FPR and 2.1% FNR. That means it introduces some false alarms but misses far fewer anomalies in this experimental setting. The MCC of 0.945 reinforces the point because MCC is harder to flatter when class balance and error types matter. Informedness and Markedness at 0.947 also support the claim that the GenAI system is not merely benefiting from one convenient metric.

The confusion-matrix discussion serves as supporting evidence rather than a separate thesis. The authors report stronger diagonal concentration for the GenAI detector and less inter-class confusion, especially across categories such as data injection, DoS, SP-time, feature-specific suspicious parameter changes, replay, and normal traffic. In practical terms, this suggests the detector is not just saying “something looks weird”; it is better at assigning the weirdness to the right kind of weirdness.

The evidence map separates the useful claims from the shiny ones

The paper contains several layers of evidence. They are not all doing the same job.

Paper component	Likely purpose	What it supports	What it does not prove
HIL/Wireshark/tshark data extraction description	Implementation context	GOOSE feature extraction can be grounded in realistic substation-style packet capture workflows.	The paper does not make the HIL testbed itself the main experimental contribution.
Eight GOOSE rules	Mechanism definition	Synthetic generation and detection are constrained by protocol-specific relationships, not just generic packet statistics.	The rules do not cover every IEC61850 protocol or every deployment variant.
CGAN vs AATM class distributions	Main evidence for data generation	AATM generates more balanced class coverage across 13 normal/anomaly/error categories.	Balance alone does not certify operational realism.
BR and RR comparison	Main evidence for synthetic-data quality	AATM improves both distributional balance and rule-based realism versus CGAN.	RR is defined by the paper’s rule framework, not by live utility validation.
FNN/RNN/SVM vs GenAI ADS table	Main evidence for detection	Claude Pro ToD ADS outperforms baselines across standard and advanced metrics on AATM-generated datasets.	It does not show superiority across all detector architectures, all prompts, all LLMs, or all live environments.
Confusion matrices	Supporting diagnostic evidence	GenAI classification appears less confused across similar anomaly classes.	The paper does not provide a full deployment study with latency, failover, adversarial prompting, or compliance review.
Example GenAI response	Interpretability illustration	The detector can articulate rule-based reasons for classifications.	A readable explanation is not automatically a verified causal explanation.

This matters because the business reader is usually tempted to compress the paper into one sentence: “GenAI beats ML for smart-grid cyber defence.” That sentence is technically convenient and strategically hazardous. The better sentence is: “Protocol-aware synthetic data plus contextual rule reasoning produced stronger detection results than the selected ML baselines in this AATM-generated GOOSE benchmark.”

Less catchy. More useful. Tragic, really.

The business value starts before the detector

For utilities and vendors, the immediate value is not necessarily replacing the current anomaly detector. The first monetisable layer is better test data.

A utility SOC or substation security vendor faces a stubborn problem: rare attack classes are expensive to collect and risky to stage. Waiting for real incidents is not a data strategy; it is a resignation letter with a timestamp. AATM suggests a way to build richer synthetic testbeds for attack and error categories that would otherwise be underrepresented.

That opens three business pathways.

First, model evaluation becomes less flattering. A detector that looks strong on imbalanced data can be exposed when minority classes receive proper coverage. This is useful for procurement, benchmarking, and internal model governance. Nobody wants to discover during an incident that the detector’s great accuracy came from learning “mostly normal.”

Second, red-team simulation becomes more systematic. Because AATM targets specific protocol-rule violations while preserving broader validity, it can support scenario libraries: replay-like behaviour, suspicious parameter changes, timing gaps, frequency anomalies, and zero-day-style perturbations. That helps security teams test detection logic without relying only on historical incidents.

Third, triage can become more explainable. The GenAI detector’s example outputs describe why a dataset is classified as normal, data injection, or DoS-like, using relationships among stNum, sqNum, data fields, and timing patterns. For analysts, that explanation can reduce the gap between “alert fired” and “engineer understands what to inspect.” This is where GenAI’s language interface may matter: not as a replacement for protection engineering, but as a bridge between packet-level signals and operational reasoning.

The procurement question is not “Do we buy GenAI?”

A better procurement question is: which part of the workflow should be made more adaptive?

The paper implies a layered adoption model:

Layer	Practical use	Buyer question
Synthetic data generation	Create balanced, protocol-aware GOOSE anomaly datasets for testing and training.	Can this generator reproduce our substation rules, configurations, and attack assumptions?
Detector benchmarking	Compare existing ADS tools against balanced synthetic scenarios.	Which detector fails on rare but high-impact categories?
Analyst triage	Generate rule-based explanations for detected anomalies.	Does the explanation help operators act faster without creating false confidence?
Continuous learning	Incorporate new scenarios and feedback loops.	Who validates updates before they touch operational workflows?
Production integration	Connect detection outputs to SOC or substation monitoring systems.	What are the latency, audit, failover, cybersecurity, and compliance controls?

The correct pilot is therefore not “install GenAI in the substation and enjoy the future.” It is more boring and more defensible: generate synthetic GOOSE scenarios, benchmark current ADS performance, validate explanations with protection engineers, and measure whether analysts reduce missed anomalies and triage time without creating new operational risk.

A pilot like that can produce business evidence. A dashboard demo produces screenshots.

The boundary is synthetic GOOSE, not universal grid autonomy

The paper’s own future-work direction acknowledges the need to move beyond single-protocol analysis toward broader IEC61850 coverage, utility-managed deployments, physics-aware models, and secure information exchange across installations. That future work is not a footnote; it is the boundary line.

Several limitations affect interpretation.

First, the detector comparison uses AATM-generated datasets. That is appropriate for testing the proposed pipeline, but it means the strongest result is internal to the paper’s synthetic generation and evaluation environment. Field data, diverse vendors, different substation topologies, and changing operational practices may shift performance.

Second, the focus is GOOSE messages. IEC61850 environments also involve other communication types and operational layers. The paper suggests the method could extend to other multicast messages, but extension is not demonstration.

Third, the GenAI detector is evaluated against FNN, RNN, and SVM baselines. Those are meaningful baselines, but they do not exhaust the modern anomaly detection menu. A production buyer would still want comparisons against stronger domain-specific, hybrid, physics-aware, graph-based, and rule-engine baselines.

Fourth, the paper gives performance metrics but not a complete operational deployment study. Real substations care about latency, deterministic behaviour, auditability, fail-safe design, model update governance, adversarial robustness, data sovereignty, and regulatory approval. The article’s most useful claim lives upstream of certified control action.

Finally, explanation is not the same as correctness. A GenAI system can provide a plausible reason. The paper’s examples are useful because they show alignment with explicit GOOSE rules. But production usage would need independent verification, logging, and human review. “The model explained itself nicely” is not a compliance framework. It is a meeting note with better grammar.

The strategic reading: synthetic defenders are a data infrastructure play

The paper’s title foregrounds GenAI, but its strongest business contribution is the pairing of synthetic data generation with contextual detection. AATM supplies better threat coverage. The GenAI ToD detector uses rules and message context to classify anomalies with stronger reported metrics than selected ML baselines. Together, they make smart-grid security look less like a pure model-selection problem and more like a data-infrastructure problem.

That is the useful shift.

For years, industrial cybersecurity has been stuck between two uncomfortable facts: the most dangerous events are rare, and the systems that need protection are too important to use as casual experiment platforms. Synthetic, protocol-aware anomaly generation is one way to escape that trap. It gives defenders more scenarios to test before the grid supplies its own, usually at an inconvenient hour.

The GenAI layer then adds a second advantage: contextual interpretation. When anomaly classes depend on relationships among fields and sequences, a detector that can reason over rules may outperform one that only learns statistical surfaces. The paper’s results support that possibility within its benchmark.

The sober conclusion is not that GenAI has reinvented smart-grid security all by itself. The sober conclusion is more interesting: in critical infrastructure, GenAI becomes useful when it is boxed inside domain rules, fed with disciplined synthetic data, and judged by operational error trade-offs rather than vibes.

Synthetic defenders will not replace protection engineers. They may, however, give those engineers a better adversary to practise against and a better assistant when the packets start telling strange little lies.

Cognaptus: Automate the Present, Incubate the Future.

Aydin Zaboli and Junho Hong, “Generative AI for Critical Infrastructure in Smart Grids: A Unified Framework for Synthetic Data Generation and Anomaly Detection,” arXiv:2508.08593, 2025, https://arxiv.org/abs/2508.08593. ↩︎

TL;DR for operators#

The first comparison is about data, not detection#

AATM beats CGAN by respecting the protocol’s boring details#

The second comparison asks whether GenAI detects relationships better than baselines#

The most operational number is the missed-anomaly rate#

The evidence map separates the useful claims from the shiny ones#

The business value starts before the detector#

The procurement question is not “Do we buy GenAI?”#

The boundary is synthetic GOOSE, not universal grid autonomy#

The strategic reading: synthetic defenders are a data infrastructure play#