Email looks simple until money is involved.
A suspicious invoice arrives. The subject line is dull, the body is polite, the sender domain looks almost right, and the attachment name is just credible enough to avoid comedy. A traditional filter may look for bad words, suspicious links, known domains, or old campaign signatures. A human may look for tone. An LLM may read the whole thing and decide whether the message is phishing, spam, or valid.
The awkward part is not that LLMs fail. The awkward part is that they fail in a very specific way.
The paper behind this article, The Phish, The Spam, and The Valid: Generating Feature-Rich Emails for Benchmarking LLMs, introduces PhishFuzzer, a metadata-rich email generation and benchmarking framework designed to test exactly this problem.1 Its most useful finding is not the headline dataset size, although 23,100 emails is not exactly a postcard collection. The useful finding is a mechanism: adding structural email metadata improves phishing detection, but it also pushes more spam into the “valid” bucket.
So the naive lesson — “give the model more context and it will behave better” — needs a small, unfashionable correction.
More context helps only if the decision boundary is the right one.
The real problem is not phishing alone, but the three-way split
Most business discussions about email defense collapse the problem into a binary choice: malicious or safe. That is convenient. It is also operationally incomplete.
Enterprise inboxes do not contain only attacks and harmless communication. They also contain unsolicited marketing, conference invitations, vendor outreach, gray-area promotions, newsletter clutter, aggressive sales sequences, and messages that are annoying but not necessarily malicious. The model has to distinguish at least three classes:
| Class | Operational meaning | Why it is difficult |
|---|---|---|
| Phishing | Deceptive message designed to steal credentials, trigger payment, open malicious files, or manipulate the user into risky action | Usually has stronger security signals, especially links, attachments, impersonation, or urgent action requests |
| Spam | Unwanted or low-value communication, often promotional or gray-area | Depends heavily on recipient context and organizational policy |
| Valid | Legitimate communication that should not be blocked | Can resemble spam in format, tone, or sender unfamiliarity |
That middle category is where the paper becomes interesting. Phishing is often security-coded. Spam is preference-coded. A fake login page is bad for everyone. A vendor webinar invitation may be useful to one employee, useless to another, and suspicious to a third who has been trained to distrust all unsolicited links. Welcome to classification. Please enjoy your ambiguity.
PhishFuzzer’s contribution is therefore not merely “another phishing dataset.” It builds a three-class benchmark where the model must separate phishing, spam, and valid email while also receiving, or not receiving, structural metadata such as sender information, URLs, and attachment filenames.
That setup matters because real email defense is not a reading comprehension exam. It is a policy decision under uncertainty.
PhishFuzzer builds variants from real email templates
The dataset starts with 3,300 seed emails. Of these, 300 are manually curated private emails, balanced across phishing, spam, and legitimate messages. The remaining 3,000 come from public datasets, which are useful but structurally incomplete: URLs may only appear when they are visible in the body, and attachments may be represented as binary flags rather than filenames.
The authors then enrich the dataset in stages.
First, they benchmark LLMs on intent labeling. A 99-email subset, balanced across classes, is labeled by two domain experts according to the primary action requested from the recipient: follow a link, open an attachment, reply, or unknown. Several LLMs are tested repeatedly. Gemini-2.5-Flash performs best in this step, reaching 97.98% strict accuracy with perfect internal consistency.
Second, the authors test LLMs for structural label augmentation: generating plausible URLs and attachment names when public datasets lack these fields. This is not free-form creative writing. The generation is constrained by motivation and class. A phishing message that asks the recipient to follow a link should receive a deceptive or invented domain, not a real official domain that would imply a valid sender. A spam or legitimate email can use real corporate domains where appropriate. The authors report that Gemini-2.5-Flash produced the most reliable metadata, while Claude and GPT sometimes generated placeholders such as “insert link here,” which is charming in a deeply unusable way.
Third, the enriched seed dataset is expanded. Each seed email becomes a template. For every template, the authors generate six synthetic variants along two controlled dimensions:
| Dimension | Values |
|---|---|
| Entity type | Globally recognized entity; fabricated but realistic entity |
| Length | Short; medium; long |
The result is 3,300 original seed emails plus 19,800 synthetic variants, for a final dataset of 23,100 emails. Each template has seven instances: the original plus six variants. This design lets the authors ask whether a model understands the underlying email logic or merely reacts to surface wording.
That is the first important mechanism. The benchmark is not only testing whether the model can classify one message. It is testing whether the model remains stable when the same intent is expressed with different names, lengths, and wording.
The benchmark separates four kinds of evidence
The paper’s experiments are easy to misread if treated as one big leaderboard. They are better read as four different tests, each with a different purpose.
| Test or result group | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Intent-label benchmark on 99 expert-labeled emails | Implementation validation | The enrichment pipeline can identify requested recipient actions with high reliability when using the selected model | That all intent labels across the final dataset are perfectly correct |
| Metadata augmentation benchmark | Implementation detail and data-quality control | Structural fields such as URLs and filenames can be generated under explicit constraints | That synthetic metadata is equivalent to real production telemetry |
| Provenance and variant analysis | Robustness and sensitivity test | Models respond differently to private/public seeds and generated variants | That one model is universally safer in all deployment settings |
| Confusion matrices, F1 scores, Conf@7, TFS@7 | Main evidence on classifier behavior | Metadata improves phishing detection while worsening spam sensitivity | That zero-shot LLMs alone are sufficient enterprise filters |
This distinction is useful because the paper contains several numbers, and numbers have a habit of pretending to be conclusions. The most important numbers here are not the aggregate accuracies. They are the class-level shifts.
Metadata makes phishing sharper and spam blurrier
The authors benchmark Qwen-2.5-72B and Gemini-3.1-Pro under two prompting settings:
- Basic: subject and body;
- Full: subject, body, sender, URLs, and attachment filenames.
At first glance, the aggregate accuracy barely moves. Qwen is at 0.688 accuracy in Basic and 0.684 in Full. Gemini is at 0.733 in Basic and 0.731 in Full. A hurried reader might conclude that metadata does not matter.
That would be the wrong conclusion.
The confusion matrices and class-specific F1 scores show the real effect. Metadata changes the model’s posture.
| Model | Prompt | Accuracy | Macro F1 | F1 Phishing | F1 Spam | Conf@7 | TFS@7 |
|---|---|---|---|---|---|---|---|
| Qwen-2.5-72B | Basic | 0.688 | 0.655 | 0.893 | 0.387 | 52.88% | 15.45% |
| Qwen-2.5-72B | Full | 0.684 | 0.636 | 0.917 | 0.310 | 55.33% | 17.42% |
| Gemini-3.1-Pro | Basic | 0.733 | 0.694 | 0.939 | 0.428 | 60.55% | 14.15% |
| Gemini-3.1-Pro | Full | 0.731 | 0.685 | 0.958 | 0.383 | 62.88% | 14.88% |
The phishing result is strong. Qwen’s phishing F1 rises from 0.893 to 0.917 when metadata is added. Gemini’s phishing F1 rises from 0.939 to 0.958. In concrete terms, Qwen misclassifies 1,190 phishing emails as valid under Basic prompting; with metadata, that falls to 882. Gemini starts much stricter: only 33 phishing emails are classified as valid under Basic, falling to 13 with metadata.
That is the good news.
The bad news is spam. Qwen’s spam F1 falls from 0.387 to 0.310. Gemini’s falls from 0.428 to 0.383. Spam recall also drops: Qwen falls from 25.11% to 19.06%, and Gemini falls from 28.65% to 24.74%. Most of those missed spam emails are absorbed into the valid class.
This is not a random side effect. It is the central mechanism.
Metadata gives the model stronger signals for phishing. Sender, URL, and attachment data make impersonation, deceptive domains, and risky action requests easier to spot. But those same signals can make promotional or gray-area messages look structurally legitimate. If a spam email has a plausible sender, a real-looking domain, and a coherent marketing narrative, the model may decide it is valid. The email is not cleanly malicious; it is merely unwanted. Models, like executives, are often terrible at dealing with “merely unwanted.”
Gemini is stricter; Qwen is more relaxed
The paper’s model comparison is not just “Gemini performs better.” The more useful interpretation is that the models exhibit different security postures.
Qwen-2.5-72B behaves as a more relaxed classifier. It is less likely to flag valid emails as phishing, which reduces false positives on legitimate traffic. But it also misses more phishing emails, especially without metadata. Metadata helps Qwen reduce phishing-as-valid errors, but it worsens spam detection.
Gemini-3.1-Pro behaves more strictly. It misses far fewer phishing emails, but under Basic prompting it penalizes legitimate traffic more often. In the Basic setting, Gemini misclassifies 398 valid emails as phishing. With metadata, that number falls to 215. Here metadata is not merely an attack detector; it also acts as a corrective signal for legitimate messages.
That creates an uncomfortable but useful business point: “best model” depends on the cost structure.
If the organization is a bank, a hospital, or a law firm, missed phishing may be far more expensive than overblocking promotional messages. A stricter model may be tolerable. If the organization depends heavily on external sales leads, procurement messages, or event invitations, false positives and gray-area spam handling become more expensive. In that case, model strictness must be tuned against business workflow, not just security ideology.
The inbox is not a battlefield with only enemies and allies. It is also a networking event with malware.
Conf@7 and TFS@7 reveal whether errors are accidental or structural
One of the more valuable parts of the paper is its template-level reliability analysis.
Because every seed email has six generated variants, the authors can group seven related messages under the same template. They then compute:
- Conf@7: the share of templates where the model correctly classifies all seven versions;
- TFS@7: the share of templates where the model fails on all seven versions.
This is better than ordinary accuracy for one reason: it distinguishes noisy mistakes from systematic blind spots.
A model that misclassifies one variant may have stumbled over phrasing. A model that misclassifies the original plus all six variants has misunderstood the underlying email family. In security terms, that is more serious. Attackers do not need every variation to bypass a filter. They need reliable weak spots.
Gemini has stronger template consistency than Qwen. Under Full prompting, Gemini reaches 62.88% Conf@7 compared with Qwen’s 55.33%. But metadata also increases the number of consistently failed templates for both models. Qwen’s TFS@7 rises from 510 templates to 575. Gemini’s rises from 467 to 491.
Again, more metadata makes correct classifications more stable, but it does not eliminate systematic failure. In some cases, it makes the model more confidently wrong. A very modern problem, though not an exclusively artificial one.
The TFS matrices also expose the class-level nature of these failures. Spam-to-valid is the dominant systematic error. Valid emails are rarely systematically misclassified, and metadata eliminates the small number of valid-template blind spots for both models. For phishing, metadata generally helps, though the paper reports one notable case where a public-dataset phishing label was discovered through manual investigation to be wrong: the email was actually a legitimate inquiry from an MIT Lincoln Laboratory researcher. The authors retained the original label for the experiment but corrected it in the released dataset.
That example matters. Systematic failure analysis does not only diagnose models. It can diagnose the benchmark itself.
The paper is really about benchmark design, not model triumph
The tempting article headline would be: “LLMs are great at phishing detection.” That would be partly true and mostly insufficient.
A better headline is that email-security benchmarks need to preserve the evidence that real systems use. Body text alone is not enough. Sender fields, URLs, attachment names, provenance, intent labels, and controlled variants all change the shape of the evaluation.
PhishFuzzer contributes three practical benchmark-design ideas.
First, structural metadata should be part of evaluation. Real email filters do not see only prose. A benchmark that strips URLs, sender domains, and attachments turns email defense into an artificial reading task.
Second, three-class classification should not be collapsed too quickly. Combining phishing and spam into “unwanted” may improve apparent simplicity, but it hides the operational distinction between malicious deception and low-value communication. Security teams need that distinction because the appropriate response differs.
Third, variant grouping reveals robustness. A model that performs well on individual messages may still fail consistently on a family of semantically equivalent variants. Template-level metrics make those blind spots visible.
For vendors, this matters because demos often reward cherry-picked examples. For SOC teams, it matters because production systems fail in clusters, not in isolated academic rows. For governance teams, it matters because model evaluation has to include failure modes that correspond to real operational harm.
What businesses can infer, and what they should not
The paper directly shows that, on PhishFuzzer, two zero-shot LLMs achieve strong phishing detection, especially when metadata is included. It also shows that spam remains difficult and that metadata can worsen spam-versus-valid separation. Gemini is more consistent across variants, while Qwen appears more sensitive to fuzzing and more relaxed in classification posture.
From this, Cognaptus would infer three business-relevant lessons.
| Business question | Paper-grounded answer | Practical interpretation |
|---|---|---|
| Should email-security evaluations include metadata? | Yes. Metadata materially changes class-level behavior. | Test models under both body-only and full-context settings before deployment. |
| Can LLMs replace conventional email filters? | Not shown. The study tests two zero-shot LLMs on a benchmark. | Treat LLMs as semantic reasoning components, not complete filtering infrastructure. |
| Where is the main deployment risk? | Spam and valid emails blur together, especially with metadata. | Use policy layers, user preferences, sender reputation, and feedback loops for gray-area classification. |
The uncertain part is deployment generalization. PhishFuzzer is partly synthetic and partly enriched. It is designed carefully, but it is still a benchmark, not a live enterprise mailstream. It does not test fine-tuned encoders, traditional ML baselines, commercial secure email gateways, or hybrid systems under production latency and adversarial feedback.
That boundary is not a weakness of the paper. It is the correct place to stop pretending.
Hybrid filtering looks more realistic than LLM-only filtering
The most plausible architecture after reading this paper is not a heroic LLM sitting alone at the gate.
A more realistic system would combine:
- Structural checks for sender authentication, domain reputation, URL risk, attachment type, and known malicious infrastructure;
- LLM-based semantic analysis for intent, deception, urgency, impersonation, and action requests;
- User-aware or organization-aware policy for spam versus valid communication;
- Template-level robustness testing using paraphrases and entity substitutions;
- Feedback loops from user reports, SOC review, and false-positive analysis.
This division of labor matches the paper’s evidence. LLMs are valuable where meaning matters: understanding whether the email asks the recipient to click, open, reply, pay, reset, verify, or disclose. But spam is not merely a semantic class. It is a relationship between message, recipient, timing, role, and organizational tolerance.
A model cannot know whether a vendor invitation is useful unless the system knows something about the recipient’s job, current projects, accepted vendors, prior interactions, and organizational policy. Without that context, “spam” becomes a vibes-based category wearing a lab coat.
The limitations are practical, not ceremonial
The authors identify several boundaries that materially affect interpretation.
The evaluation covers only two LLMs in a zero-shot setting. That leaves open the performance of fine-tuned encoder models, classical machine-learning systems, and commercial hybrid filters. It also leaves open whether model-specific prompting or calibration could improve the spam-valid boundary.
The dataset includes synthetic variants and LLM-enriched metadata. This is useful for controlled testing, but synthetic structure is not the same as production telemetry. Real attackers adapt. Real senders have messy histories. Real employees click things because they are tired.
The spam category is inherently subjective. This is the most important limitation for business readers. The paper’s spam failures are not merely model defects; they reflect a category whose ground truth depends on recipient preference and organizational policy. That does not make the metric useless. It means spam classification should be treated as policy-sensitive rather than purely universal.
Finally, the discovery of a mislabeled public-dataset email is a useful reminder that benchmark labels are not sacred. Sometimes the model is wrong. Sometimes the dataset is wrong. Sometimes both are confidently wrong and everyone has a meeting.
The operational takeaway: evaluate the shift, not just the score
The best way to use this paper is not to ask, “Which LLM won?”
The better question is: “How does adding real email context shift the model’s decision boundary?”
In PhishFuzzer, metadata shifts the boundary in a recognizable direction. Phishing detection improves. Valid-email false positives can improve, especially for a stricter model like Gemini. Spam detection worsens because borderline promotional or gray-area messages increasingly look legitimate.
That is the finding security teams should take seriously.
A production filter should not optimize only for aggregate accuracy. It should evaluate class-level trade-offs, template-level blind spots, and the business cost of each error type. A missed phishing email may cause credential theft. A blocked valid email may disrupt operations. A missed spam email may merely annoy the user — until enough annoyance trains users to ignore warnings altogether.
Email security in the LLM era is not about finding a magic classifier. It is about building systems that understand text, inspect structure, respect policy, and admit uncertainty where the category itself is unstable.
That is less glamorous than “AI stops phishing.” It is also much closer to reality.
And reality, unlike spam, is still worth delivering to the inbox.
Cognaptus: Automate the Present, Incubate the Future.
-
Rebeka Toth, Nils Gruschka, and Tamas Bisztray, “The Phish, The Spam, and The Valid: Generating Feature-Rich Emails for Benchmarking LLMs,” arXiv:2511.21448. ↩︎