Hook, Line, and Confidence: When Humans Outthink the Phish Bot

Opening — Why this matters now

Phishing is no longer about bad grammar and suspicious links. It is about plausibility, tone, and timing. As attackers refine their craft, the detection problem quietly shifts from raw accuracy to judgment under uncertainty. That is precisely where today’s AI systems, despite their statistical confidence, begin to diverge from human reasoning.

This paper asks an unusually relevant question: when humans and machines are given the same emails, who reasons better — and who knows when they might be wrong?

Background — From filters to cognition

Traditional phishing detection systems were built like sieves: keyword rules, blacklists, and brittle heuristics. Modern systems replaced them with dense embeddings and deep models, trading interpretability for raw performance. But phishing is a cognitive attack. It exploits urgency, authority, and emotional triggers — mechanisms humans understand instinctively but machines only approximate statistically.

Rather than chasing ever-higher accuracy scores, this study deliberately steps back. It compares interpretable models (Logistic Regression, Decision Trees, Random Forests) with human annotators, focusing not just on predictions, but on confidence calibration, linguistic cues, and demographic effects.

Methodology — A controlled human–machine face-off

The experimental design is refreshingly clean.

Both humans and models evaluated the same set of emails, drawn from the Enron dataset and supplemented with carefully constructed phishing and non-phishing messages. Humans labeled each email, rated their confidence (0–100%), and identified linguistic cues influencing their judgment. Models were trained on identical data using two representations:

TF-IDF (explicit lexical signals)
Sentence-BERT embeddings (semantic abstraction)

Crucially, the models were chosen not for state-of-the-art performance, but for transparency. This allows a like-for-like comparison of reasoning styles rather than a black-box score contest. fileciteturn0file0

Findings — Accuracy is cheap; confidence is not

Model performance (F1-score)

Model	Features	Phishing	Not Phishing
Logistic Regression	TF-IDF	0.72	0.53
Logistic Regression	Embeddings	0.67	0.50
Decision Tree	TF-IDF	0.73	0.67
Decision Tree	Embeddings	0.59	0.70
Random Forest	TF-IDF	0.73	0.67
Random Forest	Embeddings	0.70	0.70

TF-IDF — often dismissed as “old-fashioned” — consistently outperformed embeddings for phishing detection. Why? Because phishing is still lexically explicit: verify, account, urgent, link. Semantics help, but frequency still matters.

Humans vs machines

On raw accuracy, tree-based models roughly matched human performance. On confidence, they did not.

Humans clustered tightly around 60–80% confidence
Models exhibited sharp swings — overconfident on some emails, uncertain on others

This matters more than it sounds. In security systems, miscalibrated confidence is how alerts get ignored or blindly trusted.

Linguistic reasoning gap

Humans relied on a broader and richer vocabulary when explaining decisions, especially for phishing emails. Models gravitated toward a small set of recurring tokens. Efficient — but narrow.

This difference exposes a core limitation: models optimize for separability; humans reason by contextual plausibility.

Demographics — Who spots the phish?

Age mattered. Language, surprisingly, did not.

Age Group	Accuracy
18–25	71%
26–35	68%
36–45	78%
45+	74%

Mid-career and older participants performed best, likely reflecting accumulated exposure rather than raw technical skill. Native English speakers did not outperform non-native speakers on average — but showed higher variance, producing both the best and worst individual results.

The takeaway: phishing resistance is learned, not linguistic.

Implications — Designing AI that knows when it’s unsure

This study quietly undermines a popular assumption: that better embeddings automatically mean better detection. In adversarial, human-targeted domains, interpretability and confidence calibration may matter more than marginal accuracy gains.

For practitioners, the implications are concrete:

Use interpretable models where alerts must be trusted
Expose confidence, not just predictions
Design systems that complement human judgment rather than replace it

Hybrid human-in-the-loop systems are not a compromise. They are an architectural necessity.

Conclusion — Intelligence is knowing when not to click

Phishing detection is not a benchmark problem. It is a judgment problem.

Machines are fast, scalable, and consistent. Humans are cautious, contextual, and better calibrated. The future does not belong to one or the other — it belongs to systems that understand the difference.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From filters to cognition#

Methodology — A controlled human–machine face-off#

Findings — Accuracy is cheap; confidence is not#

Model performance (F1-score)#

Humans vs machines#

Linguistic reasoning gap#

Demographics — Who spots the phish?#

Implications — Designing AI that knows when it’s unsure#

Conclusion — Intelligence is knowing when not to click#