Opening — Why this matters now
Phishing is no longer about bad grammar and suspicious links. It is about plausibility, tone, and timing. As attackers refine their craft, the detection problem quietly shifts from raw accuracy to judgment under uncertainty. That is precisely where today’s AI systems, despite their statistical confidence, begin to diverge from human reasoning.
This paper asks an unusually relevant question: when humans and machines are given the same emails, who reasons better — and who knows when they might be wrong?
Background — From filters to cognition
Traditional phishing detection systems were built like sieves: keyword rules, blacklists, and brittle heuristics. Modern systems replaced them with dense embeddings and deep models, trading interpretability for raw performance. But phishing is a cognitive attack. It exploits urgency, authority, and emotional triggers — mechanisms humans understand instinctively but machines only approximate statistically.
Rather than chasing ever-higher accuracy scores, this study deliberately steps back. It compares interpretable models (Logistic Regression, Decision Trees, Random Forests) with human annotators, focusing not just on predictions, but on confidence calibration, linguistic cues, and demographic effects.
Methodology — A controlled human–machine face-off
The experimental design is refreshingly clean.
Both humans and models evaluated the same set of emails, drawn from the Enron dataset and supplemented with carefully constructed phishing and non-phishing messages. Humans labeled each email, rated their confidence (0–100%), and identified linguistic cues influencing their judgment. Models were trained on identical data using two representations:
- TF-IDF (explicit lexical signals)
- Sentence-BERT embeddings (semantic abstraction)
Crucially, the models were chosen not for state-of-the-art performance, but for transparency. This allows a like-for-like comparison of reasoning styles rather than a black-box score contest. fileciteturn0file0
Findings — Accuracy is cheap; confidence is not
Model performance (F1-score)
| Model | Features | Phishing | Not Phishing |
|---|---|---|---|
| Logistic Regression | TF-IDF | 0.72 | 0.53 |
| Logistic Regression | Embeddings | 0.67 | 0.50 |
| Decision Tree | TF-IDF | 0.73 | 0.67 |
| Decision Tree | Embeddings | 0.59 | 0.70 |
| Random Forest | TF-IDF | 0.73 | 0.67 |
| Random Forest | Embeddings | 0.70 | 0.70 |
TF-IDF — often dismissed as “old-fashioned” — consistently outperformed embeddings for phishing detection. Why? Because phishing is still lexically explicit: verify, account, urgent, link. Semantics help, but frequency still matters.
Humans vs machines
On raw accuracy, tree-based models roughly matched human performance. On confidence, they did not.
- Humans clustered tightly around 60–80% confidence
- Models exhibited sharp swings — overconfident on some emails, uncertain on others
This matters more than it sounds. In security systems, miscalibrated confidence is how alerts get ignored or blindly trusted.
Linguistic reasoning gap
Humans relied on a broader and richer vocabulary when explaining decisions, especially for phishing emails. Models gravitated toward a small set of recurring tokens. Efficient — but narrow.
This difference exposes a core limitation: models optimize for separability; humans reason by contextual plausibility.
Demographics — Who spots the phish?
Age mattered. Language, surprisingly, did not.
| Age Group | Accuracy |
|---|---|
| 18–25 | 71% |
| 26–35 | 68% |
| 36–45 | 78% |
| 45+ | 74% |
Mid-career and older participants performed best, likely reflecting accumulated exposure rather than raw technical skill. Native English speakers did not outperform non-native speakers on average — but showed higher variance, producing both the best and worst individual results.
The takeaway: phishing resistance is learned, not linguistic.
Implications — Designing AI that knows when it’s unsure
This study quietly undermines a popular assumption: that better embeddings automatically mean better detection. In adversarial, human-targeted domains, interpretability and confidence calibration may matter more than marginal accuracy gains.
For practitioners, the implications are concrete:
- Use interpretable models where alerts must be trusted
- Expose confidence, not just predictions
- Design systems that complement human judgment rather than replace it
Hybrid human-in-the-loop systems are not a compromise. They are an architectural necessity.
Conclusion — Intelligence is knowing when not to click
Phishing detection is not a benchmark problem. It is a judgment problem.
Machines are fast, scalable, and consistent. Humans are cautious, contextual, and better calibrated. The future does not belong to one or the other — it belongs to systems that understand the difference.
Cognaptus: Automate the Present, Incubate the Future.