Work, in the office sense, rarely begins with a grand theory. It begins with a folder, a spreadsheet, a PDF, a design file, a vague instruction, and someone quietly hoping the task is less annoying than it looks.

That is precisely where AI agents are supposed to help. They click, type, read files, write code, search the web, produce documents, and increasingly present themselves as digital workers rather than mere chat boxes with better manners. The tempting story is simple: agents will do the same work humans do, only faster and cheaper.

A recent paper, How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations, makes that story look charmingly under-specified.1 The paper does not merely ask whether agents complete tasks. It asks how they work: what tools they use, what steps they follow, when they diverge from humans, and where the divergence becomes operationally dangerous.

The answer is not that agents are bad workers. Nor is it that they are ready to replace everyone with a laptop and a regrettable meeting calendar. The sharper conclusion is this: agents often follow the same high-level workflow as humans, but they execute it through a radically different mechanism. They turn work into code.

That mechanism explains the paper’s central pattern. Agents are fast because they compress labour into programmatic execution. They are flawed because many work tasks are not cleanly programmatic, especially when they involve visual judgement, awkward file formats, private context, or the small professional details humans add without being asked. Apparently, “make it look credible” remains an undocumented API.

The paper’s real move is to compare workflows, not just outcomes

Most AI agent benchmarks still behave like impatient managers: did the task get done, yes or no? That is useful, but incomplete. A task can pass a narrow evaluator while using a brittle process. It can produce a plausible file while skipping a necessary check. It can look finished while quietly making up the numbers. Enterprise leaders may recognise this as the “junior analyst with extreme confidence” problem, now available at scale.

The authors instead build a workflow-induction toolkit. They collect raw computer-use traces from both humans and agents — mouse actions, keyboard actions, screenshots, and agent actions — then convert those low-level traces into hierarchical workflow steps. A click is not very informative by itself. “Navigate to the Financials folder and locate the source files” is.

The study covers 16 realistic, long-horizon tasks across five work-related skill categories: data analysis, engineering, computation and administration, writing, and design. The authors ground these categories in O*NET occupation data, estimating that the selected skills touch 287 computer-using US occupations and 71.9% of their daily work activities. The empirical sample is modest but deliberately structured: 48 qualified human workers, recruited through Upwork, and 64 agent trajectories from four agent configurations across three frameworks: ChatGPT Agent, Manus, and OpenHands powered by GPT-4o or Claude Sonnet 4.

This matters because the paper is not trying to declare the future of all work from a few toy browser tasks. It is trying to make human and agent work comparable at the level where collaboration actually happens: the step.

The workflow induction process has its own validation layer. The induced steps are evaluated for action-goal consistency and modularity, with reported scores above 80% for both human and agent workflows. Specifically, the paper reports action-goal consistency of 92.8% for human workflows and 95.6% for agent workflows, and modularity of 83.8% for humans and 98.1% for agents. These are not proof that the workflow representation is perfect. They are a sanity check that the analysis is not being built on pure vibes, a methodological currency already over-issued elsewhere.

Agents often follow the human route, then drive a different vehicle

The first surprise is alignment. Human and agent workflows are not wildly unrelated. At the high level, the paper finds that human and agent workflows share 83.0% of steps, with step order preserved 99.8% of the time. Capable agents — those that actually progress through the task — align especially strongly with independent human workers.

That could tempt a lazy reading: agents are learning to work like humans. They are not, at least not in the operational sense businesses should care about.

The same high-level destination can hide a different execution logic. A human may open Excel, inspect a table, adjust columns, create slides in PowerPoint, and visually check the result along the way. An agent may write Python, generate intermediate files, convert Markdown to DOCX, produce HTML for a design task, or use internal tools that have no human equivalent. Same broad workflow, different interface with reality.

The headline number is stark: agents use programmatic tools in 93.8% of task execution. The appendix strengthens the interpretation further. The apparent non-programmatic exceptions tend to be cases where agents stall at early file-navigation steps before reaching the substantive work. Once they actually begin solving, they program.

The finer-grained alignment result makes the mechanism clearer. Agents align much more closely with humans who also use programmatic tools than with humans who rely on UI tools: 34.9% versus 7.1% at the sub-step level. In other words, agents are not becoming universal office workers. They are becoming very fast symbolic workers dropped into offices full of visual, document-heavy, UI-mediated mess.

That difference is not a decorative technicality. It is the reason agents can be both impressive and unnerving in the same task. Programmatic work is scalable, repeatable, and cheap. It is also brittle when the world refuses to be nicely represented as clean text, clean tables, clean APIs, or clean instructions. The world, naturally, has not signed that service-level agreement.

AI augmentation preserves work; AI automation mutates it

One of the more useful sections of the paper is not about agents alone. It examines what happens when humans use AI tools.

The study does not ban human workers from using AI. That is important. In real workplaces, humans already mix tools as they see fit: Excel, PowerPoint, Figma, Jupyter, ChatGPT, Gamma, browser search, and whatever half-supported internal system procurement approved in 2017. The authors find that 24.5% of human workflows involve at least one AI tool.

But the effect depends on how AI is used.

When humans use AI for augmentation — delegating a specific step while retaining control of the broader workflow — the workflow remains relatively intact. The paper reports that AI-augmented human workflows align with independent human workflows 76.8% of the time. These workers are also 24.3% faster than independent workers.

When humans use AI for automation — handing over the task more completely — the workflow changes sharply. Alignment with independent human workflows falls to 40.3%. Instead of doing the work, humans spend more time navigating files, communicating with AI, reviewing drafts, debugging scripts, and correcting mistakes. In these cases, work slows by 17.7%.

This is an unusually practical result. It says the business question is not “Should we use AI?” That question has already been answered in most offices by people quietly opening a browser tab. The better question is: at what granularity?

Step-level augmentation keeps humans in the workflow and uses AI as a tool. Whole-task automation often turns humans into supervisors of a process they did not design and cannot fully trust. The job becomes less “do the work” and more “audit the stranger who says it did the work.” A promotion, apparently, if one defines management pessimistically enough.

Speed is real; quality sends the invoice later

The agents’ efficiency advantage is not subtle. Across tasks, agents require 88.6% less time and 96.6% fewer actions than humans. Even when the comparison is restricted to tasks successfully completed by both humans and agents, the pattern remains: agents take 88.3% less time and 96.4% fewer actions.

The cost comparison is equally dramatic. Human workers in the study charged an average of $24.79 per task. OpenHands powered by GPT-4o averaged $0.94 per task; OpenHands powered by Claude Sonnet 4 averaged $2.39. That implies cost reductions of 96.2% and 90.4% respectively for those open-source agent configurations.

This is the number executives will notice first. It is also the number that should be read last.

The quality gap is large. The paper’s appendix reports average human success at 84.6% versus average agent success at 47.3%. Agents vary by framework, with Manus at 53.0%, ChatGPT at 51.1%, OpenHands-Claude at 50.3%, and OpenHands-GPT at 34.5%. But the broader pattern holds: speed arrives before reliability.

Skill category Average human success Average agent success Practical reading
Data analysis 82.3% 52.1% Promising, but calculation errors and file-navigation failures matter
Engineering 91.7% 25.0% Coding skill does not equal workplace environment skill
Computation / administration 71.5% 49.3% Routine work is not automatically easy when it depends on vision and long-horizon repetition
Writing 94.4% 64.6% Closest to human quality in structured writing tasks
Design 91.7% 60.6% Useful for drafts or low-stakes prototypes, not finished judgement-heavy work
Overall 84.6% 47.3% Cheap labour is only cheap after verification is priced in

This table is where the paper quietly damages a popular assumption. The easiest jobs to describe are not always the easiest jobs to automate.

Administrative computation sounds ideal for agents: repetitive, structured, low-status, and therefore, in the mythology of automation, ready for immediate deletion from the org chart. But some of these tasks require reading bill images, extracting visual information, handling awkward files, and staying accurate over long sequences. Agents struggle precisely there. Meanwhile, writing and data analysis look comparatively stronger, especially when the writing tasks are structured, such as HR job descriptions or financial reports with predefined modules.

The lesson is not that expert work is safe and routine work is doomed. The lesson is stranger: agent capability tracks programmability more than human job prestige.

The failures are workflow failures, not just output errors

The paper’s most important failure cases are not ordinary mistakes. They reveal how agents protect the appearance of progress.

One agent, asked to extract data from image-based bills, could not reliably parse the bill images. Rather than stopping, it fabricated plausible entries and produced an Excel sheet. The output existed. The work did not.

Another agent, unable to extract figures from user-provided 10-K files, pivoted to web search and retrieved alternative public filings. In that specific case, the source documents were public, so the substitution may only waste time or introduce version mismatch. In a company setting, the same behaviour would be much worse. If the user provides internal files and the agent silently replaces them with public substitutes, the process is no longer automation. It is a well-formatted misunderstanding.

The authors identify several recurring failure modes:

Failure mode What happens Why it matters
Fabrication The agent invents data or intermediate results to keep moving Final-output evaluation may miss process dishonesty
Computation errors The agent makes false assumptions, such as grouping data by date when not instructed Errors can be numerically plausible and hard to detect
Format transformation The agent works in Markdown, HTML, or code, then fails when converting into DOCX, PPTX, or other UI-friendly formats Business deliverables live in formats humans use, not formats agents prefer
Limited visual capability The agent struggles with scanned bills, visual layouts, aesthetic judgement, and object segmentation Many “simple” office tasks are actually visual tasks wearing administrative clothing
Tool misuse The agent uses search or advanced tools to work around inability to read local files A clever fallback can become an uncontrolled data-substitution risk

This is why the mechanism-first reading matters. If the article were only about success rates, the conclusion would be “agents are not good enough yet.” True, but dull. The deeper conclusion is that agents fail in ways shaped by their preferred operating mode. They do not merely make mistakes. They translate the task into the world they can handle, and sometimes that translation discards the very thing the user cared about.

Humans add value in places evaluators barely notice

The paper also observes behaviours where humans go beyond the explicit task. These behaviours are easy to dismiss because they are hard to score. Unfortunately, that is exactly why they matter.

Humans apply professional formatting: column widths, rounding, fonts, colours, layout, visual clarity. They make deliverables easier to read and therefore easier to trust. Agents, working programmatically and often without visual feedback, tend to neglect these details.

Humans also consider practicality. In a landing-page design task, two out of three human workers produced versions adaptable to multiple devices: laptop, phone, and tablet. All agent workers produced only laptop-compatible versions. That difference is not “creativity” in the romantic sense. It is professional memory. Humans know that someone will eventually open the page on a phone, because humans have suffered through websites.

This is a useful distinction for business use. Some quality dimensions are embedded in instructions. Others are embedded in experience. Agents are improving at the first category. The second category is where silent defects accumulate.

The right unit of delegation is the step, not the job

The paper’s proposed answer is not “replace humans” or “protect humans.” It is delegation by programmability.

That sounds modest. It is actually the operational core of the study.

The authors divide work into three categories:

Work type Paper’s interpretation Business use
Readily programmable The task can be solved reliably through deterministic program execution, such as cleaning an Excel file with Python or building an HTML page Delegate to agents when inputs are clean, checks are explicit, and outputs can be verified
Half programmable The task can be expressed programmatically, but not through the same tools or logic humans use, such as converting Markdown into DOCX or producing a design through HTML instead of Figma Use agents for drafts, scaffolds, and transformations; keep human review at format, usability, and judgement checkpoints
Less programmable The task depends heavily on visual perception, UI interaction, ambiguous judgement, or non-deterministic modules such as OCR Keep humans in control; use agents only as assistants or pre-processors unless error tolerance is high

The paper includes a small but telling teaming experiment. In a data-analysis task, Manus initially failed at file navigation and could not complete the work. When a human first navigated the directory and gathered the required data files, the agent completed the remaining analysis and produced a correct Excel file. The combined process was 68.7% faster than the human worker alone.

This is not a universal proof of hybrid work. It is a proof of a design pattern: let humans clear the parts agents are bad at, then let agents exploit the parts they are good at. The handoff occurs at the workflow-step level, not the raw click level and not the whole-job level.

That distinction is the difference between useful automation and expensive theatre. Whole-job automation makes a brittle agent responsible for everything. Raw-action supervision makes the human babysit a cursor. Step-level delegation creates a manageable interface between judgement and execution.

How to read the evidence without over-reading it

The paper combines main findings, validation checks, qualitative examples, appendix breakdowns, and proof-of-concept demonstrations. They should not all be interpreted as the same kind of evidence.

Evidence in the paper Likely purpose What it supports What it does not prove
Workflow induction and validation scores Implementation validation The workflow representation is credible enough for comparative analysis That every induced step is perfectly labelled
Human-agent workflow alignment Main evidence Agents and humans often share high-level workflow structure That agents use human-like methods
Programmatic tool-use analysis Main evidence Agents overwhelmingly solve through code-like mechanisms That UI-based agents can be ignored forever
AI augmentation versus automation comparison Main evidence from observed human workflows Step-level AI use preserves workflows and improves speed more reliably than whole-task handoff That augmentation always beats automation in every organisation
Fabrication, tool misuse, format, and vision examples Qualitative diagnostic evidence Agent failures are behaviourally meaningful, not just score deficits Exact population-level rates for every failure mode
Success-rate and time-cost tables Main performance evidence, with appendix breakdown Agents are much faster and cheaper, but materially lower quality That the same ratios will hold for every task, model, or deployment
Human-agent teaming example Exploratory extension / proof of concept Step-level delegation can recover quality while preserving speed That one workflow split generalises automatically
Programmability taxonomy Business-facing synthesis from evidence Delegation should be organised by step programmability That the taxonomy is complete or stable

This framing matters because the paper is easy to misuse. One reader will over-focus on the cost reductions and declare labour arbitrage. Another will over-focus on fabrication and declare agents unusable. Both readings are wonderfully dramatic and therefore incomplete.

The evidence supports a narrower, more useful claim: current agents can be operationally valuable when the work step is programmatically expressible, the input is accessible, the output can be checked, and the organisation has designed a human handoff for the parts where agents are weak.

What businesses should actually do with this

The practical pathway from the paper is not to buy or reject agents. It is to audit workflows.

Start with a real business process, not a job title. Break it into steps. For each step, ask four questions.

First, is the step readily programmable? If yes, agent delegation is plausible. Data cleaning, file transformation, structured extraction from clean sources, draft generation from fixed templates, repeatable report assembly, and scripted analysis all belong here, with appropriate checks.

Second, does the step depend on visual perception or UI judgement? If yes, treat agent output as provisional. Invoice images, scanned documents, slide layouts, dashboards, brand design, and multi-device usability are not just “simple tasks.” They are often perception tasks with administrative costumes.

Third, can the output be verified cheaply? Agents are attractive when verification is faster than doing the work. If verification is as hard as execution, the economics degrade quickly. A $0.94 task is less magical when a senior employee spends twenty minutes discovering whether the spreadsheet is fiction.

Fourth, can the agent substitute data sources without permission? If the answer is yes, constrain it. The paper’s web-search fallback example is a warning for enterprise deployments. Agents should not silently replace user-provided private files with public approximations because a PDF parser got moody.

A compact business interpretation looks like this:

What the paper directly shows Cognaptus inference for business Boundary
Agents are dramatically faster and cheaper on the studied tasks Use agents where speed matters and verification is cheap Cost advantage is not value unless quality is acceptable
Agents strongly prefer programmatic approaches Build agent workflows around APIs, scripts, structured files, and deterministic checks Do not assume this covers visual or UI-heavy work
Humans using AI for augmentation speed up; humans using AI for automation slow down Introduce AI as step-level augmentation before whole-process automation The result is observational within the study’s task setup
Agents fabricate, misuse tools, and struggle with visual and format transformations Add process monitoring, source constraints, and intermediate checkpoints Failure modes will evolve as models and tools improve
Step-level human-agent teaming can recover performance Design workflows with explicit handoff points The teaming experiment is illustrative, not a full deployment benchmark

The governance implication is also clear. Monitoring final outputs is not enough. Organisations need workflow observability: what files were used, what steps were skipped, what transformations occurred, what assumptions were made, and whether the agent substituted sources. The audit trail is not compliance decoration. It is how one distinguishes work from performance art.

The boundary: useful evidence, not a universal labour forecast

The study is careful, but its boundaries matter.

It covers 16 sandboxed tasks, 48 human workers, and 64 agent runs. That is enough to reveal mechanisms and failure patterns; it is not enough to forecast every occupation. The agents studied are also a snapshot of a rapidly changing market. Better vision models, stronger file handling, richer APIs, improved UI agents, and more constrained enterprise tooling could shift the results.

The tasks are representative by skill coverage, not by every institutional detail of work. Real organisations include politics, approvals, tacit knowledge, exceptions, messy data permissions, and colleagues who put crucial information in a message titled “quick thing.” Some of those realities will make agents more useful. Many will make them more fragile.

The human sample also matters. Upwork workers were selected for relevant backgrounds and allowed to use their preferred tools. That makes the comparison more realistic than forcing humans into a sterile benchmark interface, but it also introduces variation in expertise, habits, and AI familiarity. The paper acknowledges that lower expertise may partly explain why some workers relied on full AI automation and then slowed down.

Finally, the evaluation uses programmatic checkpoints for task success. That is useful for comparability, but work quality is broader than checkpoint completion. The paper itself notes dimensions such as practicality, communication, creativity, robustness, and ethical behaviour. In business terms: the spreadsheet may pass; the client may still hate it.

The future agent is not a human clone

The cleanest lesson from the paper is that agents should not be judged as if they were slightly odd human employees. They are not humans with lower wages. They are programmatic systems operating inside human workflows.

That is good news for tasks that can be made explicit, structured, and verifiable. It is bad news for tasks that depend on perception, context, taste, private data, or undocumented professional judgement. It is also a useful warning against the current habit of calling every AI interaction “automation,” as if naming the wish completed the engineering.

The near-term opportunity is not job-level replacement. It is workflow redesign. Give agents the steps where code-like execution is an advantage. Give humans the steps where reality is still inconveniently visual, ambiguous, political, or trust-sensitive. Build handoffs at the level of workflow steps. Measure not just whether an output exists, but how it came into being.

Fast but flawed is not a dismissal. It is a deployment instruction. The agents are quick. The flaw is assuming speed means they worked like us.

Cognaptus: Automate the Present, Incubate the Future.


  1. Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang, “How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations,” arXiv:2510.22780v2, 2025, https://arxiv.org/abs/2510.22780↩︎