Opening — Why this matters now
Email security is entering its awkward adolescence. Attackers now wield LLMs capable of generating eerily convincing phishing text, while defenders cling to filters built for a more primitive era of Nigerian princes and typo‑riddled scams. The result is predictable: evasion rates are climbing, and organizations are discovering that legacy rule‑based systems buckle quickly when the attacker speaks fluent machine‑generated politeness.
The paper under review — an extensive effort to build and benchmark a richly labeled phishing–spam dataset — arrives at precisely the moment when businesses need sharper models and sharper thinking. It exposes a gap many executives suspect but rarely quantify: LLMs are good at flagging phishing, mediocre at identifying spam, and still imperfect at decoding the emotional manipulation underlying modern social‑engineering campaigns. fileciteturn0file0
Background — Context and prior art
Traditional defenses treat email filtering as a text‑classification problem: isolate keywords, track sender reputation, look for suspicious attachments, deploy a Bayesian filter, and hope for the best. These models stumble when language becomes fluid — especially when adversaries use LLMs to rewrite the same malicious message into dozens of paraphrased, human‑like variants.
Past research has examined the evolution of phishing intent, sentiment cues, and adversarial rewriting. What has been missing is an integrated dataset that captures all three: human‑written emails, LLM‑generated text, and LLM‑rephrased variants — each annotated for emotional appeal, attacker motivation, and classification label. This is where the new dataset steps in.
Analysis — What the paper actually did
The authors assembled a dataset of roughly 3,000 manually collected emails, expanded it using public archives, and then multiplied it into ~12,000 messages through controlled LLM‑powered paraphrasing. Every email includes:
- Sender metadata (anonymized)
- Subject & body
- Full URLs & attachment names
- Emotion labels (urgency, fear, greed, curiosity, altruism, authority…)
- Motivation labels (link‑clicking, credential theft, file execution, financial fraud)
- Source type: human‑written or LLM‑generated
To annotate emotions and motivations, the authors benchmarked multiple LLMs — GPT‑4o‑mini, GPT‑4.1‑mini, DeepSeek‑Chat, and Claude 3.5 Sonnet — against a 100‑email expert‑labeled ground‑truth set.
Their findings were refreshing in their imperfection:
- Emotion detection: Best model achieved a 0.60 Jaccard similarity with human labels — useful, not infallible.
- Motivation labeling: Ambiguous by nature; close‑enough accuracy hovered around ~60%.
- Classification: Claude 3.5 Sonnet hit ~67% strict accuracy, with near‑perfect performance on phishing but chronic confusion between spam and normal emails.
In short: LLMs do not crumble under paraphrasing pressure, but they still struggle with the fuzzy middle of “annoying but not malicious” email.
Findings — Results with visualization
Here are the headline performance metrics from the study, simplified:
LLM Email Classification Accuracy
| Email Type | Accuracy (Strict) | F1 – Phishing | F1 – Spam | F1 – Valid |
|---|---|---|---|---|
| Original Emails | 66.89% | 0.937 | 0.208 | 0.639 |
| DeepSeek Rephrased | 66.34% | 0.936 | 0.197 | 0.631 |
| GPT‑4o Rephrased | 66.93% | 0.937 | 0.225 | 0.638 |
| Multi‑LLM Rephrased | 66.95% | 0.931 | 0.207 | 0.632 |
Relaxed Classification (Unwanted vs. Valid)
| Email Type | Accuracy | F1 – Unwanted | F1 – Valid |
|---|---|---|---|
| All Groups | ~69–70% | ~0.73–0.74 | ~0.63 |
Emotion Labeling Consistency
| Model | Close‑Enough Match | Jaccard Similarity |
|---|---|---|
| Claude 3.5 Sonnet | 42% | 0.60 |
| GPT‑4.1‑mini | 30% | 0.57 |
| GPT‑4o‑mini | 10% | 0.46 |
| DeepSeek‑Chat | 18% | 0.45 |
Overall: LLMs detect phishing with high recall, but spam is the blind spot.
Implications — Why this matters for business and security teams
Three strategic insights emerge:
1. LLM‑generated phishing renders keyword filters obsolete.
Legacy systems cannot meaningfully detect paraphrased or stylistically varied messages. If your organization still relies heavily on static rules, every attacker with access to a free LLM already has an advantage.
2. Emotional cues are the next frontier for email defense — but models are not fully mature.
Understanding urgency, fear, or authority manipulation is critical for catching targeted spear‑phishing. Today’s LLMs can approximate these cues but still misread subtle psychological triggers. Enterprises should treat emotion‑aware filtering as promising, not solved.
3. Spam classification will remain difficult until systems become user‑aware.
The line between “unsolicited marketing” and “legitimate but unwanted outreach” depends on organizational context. Without personalized baselines, even the best general‑purpose LLM guesses wrong.
For organizations deploying AI‑driven email filtering, the message is clear: be bullish on LLM‑powered phishing defense, but calibrate expectations for broader email hygiene.
Conclusion — The bigger picture
This dataset is more than a benchmarking exercise. It is a map of where LLM‑based email detection excels, where it stumbles, and where the next wave of innovation will come from.
The takeaway is pragmatic: LLMs are now robust enough to anchor enterprise‑grade phishing detection — even against paraphrasing attacks — but they are not yet subtle enough to replace human‑aligned judgment in ambiguous communication.
As attackers automate, defenders will need datasets like this one to test, stress, and retrain models continuously. The future of email security won’t depend on a single clever algorithm, but on an ecosystem of adaptive, emotion‑aware, context‑tuned LLMs.
Cognaptus: Automate the Present, Incubate the Future.