Opening — Why this matters now

Email security is entering its awkward adolescence. Attackers now wield LLMs capable of generating eerily convincing phishing text, while defenders cling to filters built for a more primitive era of Nigerian princes and typo‑riddled scams. The result is predictable: evasion rates are climbing, and organizations are discovering that legacy rule‑based systems buckle quickly when the attacker speaks fluent machine‑generated politeness.

The paper under review — an extensive effort to build and benchmark a richly labeled phishing–spam dataset — arrives at precisely the moment when businesses need sharper models and sharper thinking. It exposes a gap many executives suspect but rarely quantify: LLMs are good at flagging phishing, mediocre at identifying spam, and still imperfect at decoding the emotional manipulation underlying modern social‑engineering campaigns. fileciteturn0file0

Background — Context and prior art

Traditional defenses treat email filtering as a text‑classification problem: isolate keywords, track sender reputation, look for suspicious attachments, deploy a Bayesian filter, and hope for the best. These models stumble when language becomes fluid — especially when adversaries use LLMs to rewrite the same malicious message into dozens of paraphrased, human‑like variants.

Past research has examined the evolution of phishing intent, sentiment cues, and adversarial rewriting. What has been missing is an integrated dataset that captures all three: human‑written emails, LLM‑generated text, and LLM‑rephrased variants — each annotated for emotional appeal, attacker motivation, and classification label. This is where the new dataset steps in.

Analysis — What the paper actually did

The authors assembled a dataset of roughly 3,000 manually collected emails, expanded it using public archives, and then multiplied it into ~12,000 messages through controlled LLM‑powered paraphrasing. Every email includes:

  • Sender metadata (anonymized)
  • Subject & body
  • Full URLs & attachment names
  • Emotion labels (urgency, fear, greed, curiosity, altruism, authority…)
  • Motivation labels (link‑clicking, credential theft, file execution, financial fraud)
  • Source type: human‑written or LLM‑generated

To annotate emotions and motivations, the authors benchmarked multiple LLMs — GPT‑4o‑mini, GPT‑4.1‑mini, DeepSeek‑Chat, and Claude 3.5 Sonnet — against a 100‑email expert‑labeled ground‑truth set.

Their findings were refreshing in their imperfection:

  • Emotion detection: Best model achieved a 0.60 Jaccard similarity with human labels — useful, not infallible.
  • Motivation labeling: Ambiguous by nature; close‑enough accuracy hovered around ~60%.
  • Classification: Claude 3.5 Sonnet hit ~67% strict accuracy, with near‑perfect performance on phishing but chronic confusion between spam and normal emails.

In short: LLMs do not crumble under paraphrasing pressure, but they still struggle with the fuzzy middle of “annoying but not malicious” email.

Findings — Results with visualization

Here are the headline performance metrics from the study, simplified:

LLM Email Classification Accuracy

Email Type Accuracy (Strict) F1 – Phishing F1 – Spam F1 – Valid
Original Emails 66.89% 0.937 0.208 0.639
DeepSeek Rephrased 66.34% 0.936 0.197 0.631
GPT‑4o Rephrased 66.93% 0.937 0.225 0.638
Multi‑LLM Rephrased 66.95% 0.931 0.207 0.632

Relaxed Classification (Unwanted vs. Valid)

Email Type Accuracy F1 – Unwanted F1 – Valid
All Groups ~69–70% ~0.73–0.74 ~0.63

Emotion Labeling Consistency

Model Close‑Enough Match Jaccard Similarity
Claude 3.5 Sonnet 42% 0.60
GPT‑4.1‑mini 30% 0.57
GPT‑4o‑mini 10% 0.46
DeepSeek‑Chat 18% 0.45

Overall: LLMs detect phishing with high recall, but spam is the blind spot.

Implications — Why this matters for business and security teams

Three strategic insights emerge:

1. LLM‑generated phishing renders keyword filters obsolete.

Legacy systems cannot meaningfully detect paraphrased or stylistically varied messages. If your organization still relies heavily on static rules, every attacker with access to a free LLM already has an advantage.

2. Emotional cues are the next frontier for email defense — but models are not fully mature.

Understanding urgency, fear, or authority manipulation is critical for catching targeted spear‑phishing. Today’s LLMs can approximate these cues but still misread subtle psychological triggers. Enterprises should treat emotion‑aware filtering as promising, not solved.

3. Spam classification will remain difficult until systems become user‑aware.

The line between “unsolicited marketing” and “legitimate but unwanted outreach” depends on organizational context. Without personalized baselines, even the best general‑purpose LLM guesses wrong.

For organizations deploying AI‑driven email filtering, the message is clear: be bullish on LLM‑powered phishing defense, but calibrate expectations for broader email hygiene.

Conclusion — The bigger picture

This dataset is more than a benchmarking exercise. It is a map of where LLM‑based email detection excels, where it stumbles, and where the next wave of innovation will come from.

The takeaway is pragmatic: LLMs are now robust enough to anchor enterprise‑grade phishing detection — even against paraphrasing attacks — but they are not yet subtle enough to replace human‑aligned judgment in ambiguous communication.

As attackers automate, defenders will need datasets like this one to test, stress, and retrain models continuously. The future of email security won’t depend on a single clever algorithm, but on an ecosystem of adaptive, emotion‑aware, context‑tuned LLMs.

Cognaptus: Automate the Present, Incubate the Future.