When the Machines Come Knocking: AI Agents vs Human Hackers in Live Penetration Tests

Security teams already know the scene. A scanner produces a long list of suspicious services, outdated servers, odd access rules, and “maybe this is bad” findings. Then the real work begins: deciding which lead matters, proving impact without breaking production, writing a report someone can act on, and not getting distracted by every shiny port that waves from the network.

The interesting question is no longer whether AI can run tools. Of course it can. Give a model a terminal and enough encouragement, and it will enumerate things with the confidence of an intern who just discovered nmap. The better question is whether an autonomous AI agent can keep its head inside a real enterprise environment: thousands of hosts, partial credentials, messy infrastructure, false leads, live-system constraints, and enough ambiguity to make a benchmark look like a toy train set.

That is why the paper Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing is worth reading carefully.¹ The headline result is tempting: ARTEMIS, the authors’ new AI penetration-testing scaffold, ranked second overall against ten cybersecurity professionals in a live university network. It discovered 9 valid vulnerabilities with an 82% valid submission rate, outperforming 9 of the 10 human participants by the paper’s scoring framework.

That is the headline. It is also the easiest way to misunderstand the paper.

The real story is not “AI is now a better hacker than humans.” Convenient, dramatic, and probably excellent for conference hallway panic. But not quite right. The paper’s stronger contribution is more mechanical: it shows that the capability of an AI security agent depends heavily on the operating scaffold around the model. The same broad generation of frontier models can refuse, stall, produce shallow scanner reports, or behave like a credible junior red-team swarm depending on how work is decomposed, delegated, summarized, resumed, and triaged.

In other words: the model matters, but the workflow around the model is where the enterprise value starts to appear.

ARTEMIS is not a smarter model; it is a better work loop

ARTEMIS stands for Automated Red Teaming Engine with Multi-agent Intelligent Supervision. The name has the usual research-project grandeur, but the design is practical. It is built around three main components: a high-level supervisor, arbitrary sub-agents, and a triage module.

That matters because penetration testing is not a single-step reasoning task. It is a long-horizon workflow. A tester scans, chooses promising targets, probes, validates, attempts exploitation within rules, documents evidence, pivots, and repeats. Most ordinary agents struggle here not because they cannot explain SQL injection in a paragraph, but because they lose the thread. They stop early. They repeat themselves. They drown in context. They chase a low-value lead. They submit noise. They refuse the task. They become, in short, a very expensive autocomplete attached to a terminal.

ARTEMIS changes the loop in four important ways.

First, it creates a large recursive task list before the supervisor begins. This reduces the cognitive burden on the supervisor and keeps the agent oriented over long horizons. The task list is not glamorous, but neither is inventory management; somehow civilization still depends on it.

Second, the supervisor can spawn sub-agents. When a scan reveals multiple promising targets, ARTEMIS does not have to choose one and forget the others. It can launch parallel investigations. In the study, ARTEMIS reached a peak of 8 active sub-agents and averaged 2.82 concurrent sub-agents per supervisor iteration. This is not just “automation.” It is horizontal exploration.

Third, ARTEMIS uses dynamic prompt creation for sub-agents. Instead of giving every worker-agent the same generic instruction, it creates task-specific system prompts with relevant tools and behavioral guidance. This is where scaffolding becomes capability elicitation: the model’s latent knowledge is less likely to leak away through vague task framing.

Fourth, ARTEMIS includes a triage module. The triager checks whether a candidate vulnerability is relevant, in scope, non-duplicate, reproducible, and properly classified before submission. This matters because the enemy of useful automated security testing is not failure. It is confident garbage.

The paper’s comparison with existing scaffolds makes this mechanism visible. Claude Code and MAPTA refused the task out of the box. Incalmo stalled during early reconnaissance. Codex and CyAgent produced more surface-level scanner-type findings. ARTEMIS, using similar underlying model families, behaved differently. The difference was not that a model suddenly learned cybersecurity from divine revelation. The difference was that the scaffold made long-horizon work possible.

The study tested a real network, not a puzzle box

The target environment was a large research university’s public and private Computer Science networks: about 8,000 hosts across 12 subnets, including 7 public subnets and 5 VPN-accessible private subnets. The environment included Unix-based systems, IoT devices, some Windows machines, embedded systems, Kerberos-based authentication, baseline vulnerability management, firewalls, logging, malware protection, and other controls depending on system risk.

This is important because much of AI cybersecurity evaluation still lives in artificial settings: CTFs, static vulnerability tasks, known CVE reproduction, or neatly packaged benchmark problems. Those settings are useful, but they often remove the mess that makes real security work difficult: partial information, live-system risk, noisy services, ambiguous results, unknown priorities, and the possibility that a technically valid lead is operationally useless.

The human comparison was also nontrivial. The study recruited 10 cybersecurity professionals, many with respected certifications such as OSCP, OSWE, OSEP, CRTO, GSE, GICSP, GWAPT, and related qualifications. Several had found serious vulnerabilities affecting large user bases. Participants were compensated, given standardized infrastructure, asked to commit at least 10 working hours, and operated under safe-harbor and non-destructive testing constraints.

The agents received comparable infrastructure. ARTEMIS ran for 16 hours, but the paper evaluated only the first 10 hours to maintain comparability with the human participants. Other scaffolds ran to completion because they could not sustain more than 10 hours of continuous work.

So the result is not “an AI solved a benchmark faster.” It is closer to: under controlled conditions, an AI scaffold operated inside a real enterprise network and produced validated security findings competitive with professional testers.

That is a different sentence. Less viral. More useful.

The leaderboard result is evidence, but not the whole argument

The paper’s main results are straightforward. Human participants discovered 49 total validated unique vulnerabilities. Each human found at least one critical vulnerability providing system- or administrator-level access. ARTEMIS placed second overall by the paper’s combined scoring framework, which weighted both technical complexity and business impact. Its strongest configuration discovered 9 valid vulnerabilities with an 82% valid submission rate.

The comparison table is easy to misuse, so it helps to separate what each piece of evidence is doing.

Paper evidence	Likely purpose	What it supports	What it does not prove
Live comparison across 10 professionals and AI agents	Main evidence	ARTEMIS can produce validated findings in a real enterprise network and rank near the top under the study’s scoring method	That ARTEMIS would beat most professionals across all industries, scopes, durations, and defensive conditions
Cybench comparison	Benchmark context / comparison with prior work	ARTEMIS does not show major scaffold uplift on simpler CTF-style tasks	That CTF performance predicts live-enterprise performance
ARTEMIS vs Codex, CyAgent, Claude Code, MAPTA, Incalmo	Scaffold comparison	Architecture and prompting materially affect agent behavior	A clean ablation isolating every individual ARTEMIS component
Agent elicitation trials with hint levels	Exploratory extension / diagnostic test	Some missed vulnerabilities were missed because ARTEMIS failed to identify the right pattern, not because it lacked all technical execution ability	That hint-assisted discovery equals autonomous discovery
Cost analysis	Operational comparison	Certain ARTEMIS variants can be cheaper per hour than professional penetration testers	That fully autonomous testing is already a complete substitute for human security services
Appendix case study comparing Participant 02 and ARTEMIS	Qualitative mechanism evidence	ARTEMIS can resemble strong human workflows in reconnaissance and target prioritization	That ARTEMIS has equivalent judgment, creativity, or post-exploitation depth

The key distinction is between performance and explanation. The leaderboard tells us ARTEMIS did well. The mechanism sections explain why.

Existing scaffolds mostly failed in boring ways. Codex finished too early. CyAgent ended in under two hours. Claude Code and MAPTA refused. Incalmo’s rigid task graph stalled. These are not exotic cybersecurity limitations; they are agent-workflow limitations. The agent either cannot stay with the task, cannot manage context, cannot delegate, cannot recover from uncertainty, or cannot pass its own safety and instruction boundaries in a useful way.

ARTEMIS was built around exactly those weaknesses. It did not merely ask a model to “do pentesting.” It created a workflow where reconnaissance leads could be split, pursued, remembered, checked, and submitted.

That is the paper’s business-relevant mechanism.

Humans still had the better nose for depth

The human participants followed broadly familiar penetration-testing patterns. They began with reconnaissance: network scanning, service discovery, vulnerability scanning, directory brute-forcing, and custom enumeration. Then they moved into exploitation and lateral movement using discovered credentials, weak configurations, default passwords, known vulnerable services, and application vulnerabilities.

The paper’s behavioral observations are especially useful. Stronger humans balanced automated scanning with manual validation. Weaker performers relied too heavily on tool output. That finding may sound obvious, but it is the hinge between “automation” and “security work.” The hard part is not producing output. The hard part is deciding what output deserves belief.

ARTEMIS looked surprisingly human in some parts of the workflow. It scanned, identified promising services, launched follow-up investigations, and documented findings. In one appendix case study, the authors compare ARTEMIS with a strong human participant. Both began with structured reconnaissance. Both identified leads. ARTEMIS, however, immediately pursued anonymous LDAP access that the human participant noted but did not return to. That is where parallelism matters: a human can forget or deprioritize a lead; an agent swarm can keep more plates spinning.

But humans still showed advantages in depth. Top human participants were more likely to pivot after finding a vulnerability or deepen a foothold. ARTEMIS tended to submit findings quickly and move on. This sometimes hurt it. In the TinyPilot case, ARTEMIS found lower-level misconfigurations but missed the more critical remote-access path that 80% of participants found. It took medium- or high-level hints for ARTEMIS to find that vulnerability in follow-up elicitation.

This is not a small detail. It tells us that the agent was often good at broad exploration and immediate validation, but weaker at knowing when a mediocre finding is a doorway rather than the destination.

That is a very junior-analyst failure. Useful, tireless, occasionally impressive, and still in need of supervision.

The CLI was both weakness and superpower

One of the most interesting parts of the paper is the GUI/CLI contrast.

ARTEMIS struggled with GUI-heavy interaction. When a vulnerability required browser-mediated understanding, it could misread web flows. The paper gives a concrete false-positive pattern: the agent interpreted “200 OK” HTTP responses as successful authentication, when the responses were actually redirects to a login page after failed attempts. A human using a browser would usually see the failure immediately. The machine, staring at response codes like tea leaves, got too excited.

This is where current agents still look brittle. They can parse structured text well, but many enterprise systems do not present their meaning as clean text. They present it through web interfaces, stateful flows, visual confirmation, awkward redirects, and half-modern admin consoles. A lot of cybersecurity work is not just “read output”; it is “understand what this system is really doing.”

Yet the same CLI dependence helped ARTEMIS elsewhere. The paper notes that humans found a vulnerability in a newer iDRAC server with a modern web interface, but no humans found the same issue in an older iDRAC server because modern browsers refused to load it due to outdated HTTPS cipher support. ARTEMIS could still interact through command-line tooling and exploit the older server.

So the limitation is not simply “AI cannot use GUIs.” The more precise lesson is this: agents are strong where the environment can be reduced to text, protocols, logs, and repeatable commands; they are weaker where operational meaning depends on visual state, interface conventions, or human common sense about what a screen is saying.

For enterprise buyers, that distinction matters more than a broad claim about “AI red teams.” If your environment is API-heavy, infrastructure-heavy, log-heavy, and service-heavy, agentic testing may deliver real coverage. If your risk sits inside brittle admin UIs, SaaS workflows, browser-mediated permissions, and stateful business processes, current agents need tighter human-in-the-loop control.

The hint trials show a recognition problem, not a pure skill problem

The agent elicitation trials are easy to misread, so they deserve careful treatment. These were not the main autonomous comparison. They were diagnostic tests.

The authors took four vulnerabilities that humans found but ARTEMIS missed, then asked ARTEMIS to find them under five hint levels: high, medium, low, informational only, and host only. The targets included email spoofing through an unauthenticated SMTP relay, SQL injection, stored XSS, and unauthenticated remote console access.

ARTEMIS found all four at least once when hints were provided. That does not mean ARTEMIS would have found them autonomously with more time. It means the bottleneck was often recognition and prioritization, not the complete absence of technical procedure.

This distinction is useful for product design. If an AI security agent fails because it cannot execute a task even when pointed at it, the missing capability is technical. If it can execute when guided but fails autonomously, the missing capability is search, prioritization, pattern recognition, or persistence. Those are different engineering problems.

The paper’s evidence suggests ARTEMIS often fell into the second category. It could do more than it independently chose to do. The agent saw too many leads, pursued some, submitted others, and moved on. The failure was not always “cannot exploit.” Sometimes it was “does not realize this is the one worth exploiting.”

That is a familiar enterprise automation pattern. The model is not useless; the routing policy is immature.

The business value is cheaper coverage, not autonomous heroics

The paper includes a cost comparison that will be the first slide in many vendor decks, because vendor decks are legally required to sprint toward the most flattering number. The authors tracked API costs and found that one ARTEMIS configuration cost $291.47 over the full execution window, or $18.21 per hour. Another cost $944.07, or about $59 per hour. The paper compares this with a cited average U.S. penetration tester salary of roughly $125,034 per year.

The careful interpretation is not “replace the pentesters.” That would be both premature and, frankly, a little unimaginative.

The more defensible business interpretation is that agentic testing can reduce the cost of breadth. Agents can enumerate many hosts, pursue parallel leads, run repetitive validation, maintain notes, and produce candidate reports. This changes the economics of security coverage. A human team can spend less time doing first-pass exploration and more time on judgment-heavy work: exploit chaining, business impact interpretation, safe validation, remediation prioritization, and adversarial thinking.

Think of the operational split this way:

Work category	ARTEMIS-style agent advantage	Human advantage	Practical operating model
Broad reconnaissance	Parallel scanning, tireless enumeration, persistent task lists	Choosing scope-sensitive priorities	Agent-first, human-supervised
Candidate vulnerability validation	Repeatable checks, report drafting, evidence collection	Recognizing misleading signals and false positives	Agent proposes, human verifies
GUI-heavy testing	Currently weak unless paired with reliable computer-use tooling	Strong visual and workflow understanding	Human-led or tightly supervised
Pivoting after access	Can follow instructions, but may submit too early	Better judgment about when a finding opens a deeper path	Human directs escalation paths
Long-horizon coverage	Context summarization and session continuation	Strategic reprioritization under ambiguity	Mixed workflow
Final risk communication	Structured report generation	Business context, severity calibration, remediation negotiation	Human-owned

This is where the paper becomes relevant for business process automation. The value is not that an AI agent becomes a cyber-ninja, black hoodie included. The value is that security testing can become more continuous, more parallel, and less bottlenecked by scarce expert hours.

That matters because most organizations do not suffer from having too many elite red-teamers with nothing to do. They suffer from queues: too many assets, too many services, too many exposures, too little time. An agent that can cheaply expand first-pass coverage is useful even if it still needs human validation.

The boundary conditions are not decorative; they define the product

The study is unusually realistic for AI cybersecurity research, but it is still a controlled study. The limitations are not minor legal footnotes placed near the end so everyone can feel responsible. They shape how the result should be used.

First, the engagement was compressed. Participants had up to 10 hours of active engagement and 4 days of system access, while many penetration tests run for one to two weeks. A longer test might favor humans more, especially for deeper pivoting, creative chaining, and business-context interpretation. Or it might favor agents more if long-horizon scaffolding improves. The paper does not settle that.

Second, the target was one university environment. It was large and real, but still one environment. Enterprise networks differ wildly: cloud-native companies, banks, hospitals, manufacturing firms, SaaS platforms, and government agencies do not present the same attack surface.

Third, defensive conditions were not fully adversarial. The IT team knew about the test and manually approved flagged actions that might otherwise have been blocked. That means the study is closer to controlled penetration testing than to stealthy intrusion under live blue-team pressure.

Fourth, sample size was small. Ten professionals are valuable, especially given their qualifications and the study’s logging depth, but this is not enough for broad statistical claims about all human professionals.

Fifth, ARTEMIS submitted more false positives than humans. The triage module helped, but it did not solve the problem. In security operations, false positives are not just annoying; they consume analyst trust. A system that is cheap but noisy may simply move cost from testing to validation.

These boundaries do not weaken the paper. They make it more useful. The paper does not show that enterprises can fire their security teams and replace them with a multi-agent scaffold. It shows that AI agents are becoming credible first-pass operators in realistic environments when the scaffold is designed for long-horizon work.

That is already enough to matter.

What security leaders should infer, and what they should not

A cautious security leader should take three lessons from this paper.

The first lesson: evaluate the scaffold, not just the model. Asking whether “GPT-5 can do pentesting” is the wrong level of analysis. The paper shows that the same broad model class can refuse, stall, or perform depending on architecture. Procurement teams should ask about task decomposition, context management, sub-agent delegation, triage, logging, report reproducibility, and escalation controls. The model name is only one line in the architecture.

The second lesson: deploy agentic testing where breadth is expensive. Asset discovery, service enumeration, known-misconfiguration checks, low-risk validation, and preliminary report drafting are natural candidates. These are areas where parallelism and persistence pay off.

The third lesson: keep humans at the points where mistakes are costly. GUI-mediated findings, privilege escalation, destructive-risk decisions, severity calibration, exploit chaining, and remediation negotiation should remain human-owned or at least human-approved. “Autonomous” should not mean “unobserved.” In the study itself, agents were monitored by the research team and the target IT department. That governance layer is part of the result, not an optional accessory.

The wrong lesson is that an AI agent has “beaten human hackers.” That phrase is useful only if the objective is to maximize confusion per syllable.

A better lesson is that AI agents are crossing from toy benchmark competence into operationally relevant security labor. They are not replacing expert judgment. They are expanding the amount of structured reconnaissance and candidate validation an organization can afford to perform.

That is a quieter claim. It is also the one security teams can actually use.

The machines are knocking, but they still need a supervisor

ARTEMIS is impressive because it changes the shape of the work. It turns one model into a supervised swarm. It keeps track of tasks across sessions. It creates specialized sub-agent prompts. It verifies candidate findings before submission. It works in parallel. It sometimes behaves like a disciplined junior team that never sleeps and charges by the API call.

It also makes mistakes that a human would catch in seconds. It misses GUI-heavy paths. It submits too early. It sometimes treats protocol output as proof when it is merely noise wearing a nice status code. It can find more when hinted, which means its execution ability sometimes outruns its own judgment.

That combination is exactly what makes the paper important. The agent is neither a gimmick nor a finished replacement. It is a new layer in the security operating model: broad, persistent, cheap, and increasingly capable, but still bounded by recognition, interpretation, and governance.

For enterprises, the practical question is not whether AI agents will participate in security testing. They will. The question is where to place them so their strengths reduce real bottlenecks without letting their weaknesses create new ones.

The machines are knocking. Sensible organizations will not hand them the keys. They will give them a scoped map, a logging system, a triage queue, and a human who knows when a “finding” is actually just a doorbell.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Justin W. Lin et al., “Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing,” arXiv:2512.09882v2, 2026, https://arxiv.org/html/2512.09882. ↩︎

ARTEMIS is not a smarter model; it is a better work loop#

The study tested a real network, not a puzzle box#

The leaderboard result is evidence, but not the whole argument#

Humans still had the better nose for depth#

The CLI was both weakness and superpower#

The hint trials show a recognition problem, not a pure skill problem#

The business value is cheaper coverage, not autonomous heroics#

The boundary conditions are not decorative; they define the product#

What security leaders should infer, and what they should not#

The machines are knocking, but they still need a supervisor#