Copilot at Work: How Generative AI is Quietly Rewriting Job Descriptions

TL;DR for operators

A new Microsoft Research paper does something more useful than another round of “AI will change everything” bingo: it looks at roughly 200,000 anonymised U.S. Bing Copilot conversations and asks which work activities people actually use generative AI for.¹ The result is not an automation forecast. It is a map of where AI already touches work.

The paper’s most important move is to split every conversation into two sides. The user goal is what the human is trying to do. The AI action is what the model does in response. That sounds pedantic until it saves the analysis from the usual nonsense. A worker asking how to print a document is trying to operate office equipment; the AI is training someone to use equipment. Same conversation, different work activities, different business meaning.

The strongest signal is in information work: gathering information, maintaining knowledge, writing, editing, explaining, responding to customer inquiries, providing information to customers, and preparing informational material. The highest-scoring occupations include interpreters and translators, writers and authors, service sales representatives, customer service representatives, technical writers, editors, data scientists, market research analysts, and management analysts. The common thread is not prestige. It is information moving through language.

For operators, the immediate use is task triage. Do not ask, “Which roles can we replace?” Ask: which recurring activities involve information creation, transformation, explanation, retrieval, or customer response; which of those can AI assist; which can it perform directly; and which require human judgement, accountability, relationship management, or physical execution? Yes, that is less exciting than a mass-layoff spreadsheet. It also has the advantage of being connected to reality.

The boundaries matter. The study is U.S.-centric, based on Copilot usage from January to September 2024, relies on LLM classification, and uses O*NET’s task taxonomy. It does not directly measure productivity, ROI, quality risk, regulatory exposure, job loss, or wage effects. It measures applicability: where observed AI usage overlaps with occupational work activities and appears to complete a meaningful part of them.

The mistake is treating an applicability score as a layoff forecast

The lazy version of this paper is already obvious: rank occupations by AI applicability, write “these jobs are at risk,” add a dramatic chart, collect clicks, and go home. Conveniently, the paper is better than that.

Its central claim is not that Copilot usage predicts job destruction. The claim is narrower and more interesting: real-world AI conversations reveal which work activities are already being assisted or performed by a general-purpose chatbot, and those activities can be mapped back to occupations.

That difference matters because occupations are not single tasks. A customer service representative does not “answer questions” in the abstract. They interpret policy, calm customers, document issues, escalate exceptions, navigate systems, and absorb organisational dysfunction like a cheerful sponge. An AI system may perform one part of that bundle well and still be useless, dangerous, or merely awkward at the rest.

The paper therefore avoids the heroic assumption that a job is either automated or untouched. Instead, it decomposes jobs into O*NET intermediate work activities, or IWAs. Then it asks three questions:

Question	What the paper measures	Why it matters operationally
Is the activity present in Copilot use?	Activity share and coverage	Shows where users are already bringing work-like tasks to AI
Does the conversation appear successful?	Completion and user feedback	Separates mere experimentation from plausible utility
Is the AI contribution broad or narrow?	Scope, rated from none to complete	Distinguishes a toy answer from a meaningful chunk of the activity

That is the measurement machine. Without it, the occupation ranking is just a leaderboard with a tuxedo.

The real unit of analysis is the conversation split in two

The paper’s most elegant design choice is the user-goal versus AI-action split.

In many AI-at-work discussions, augmentation and automation are treated as vibes. A model “augments” if the speaker wants reassurance. It “automates” if the speaker wants drama. Here, the authors operationalise the distinction.

A user goal captures what the human is trying to accomplish. This is the assistive side of AI: the worker remains the actor, and the model helps. An AI action captures what the model itself does. This is the delegation side: the model performs a work-like activity inside the conversation.

The split reveals that conversations are not symmetric. In the uniform Copilot sample, each conversation is matched to an average of about 2.9 user-goal IWAs and about 6.6 AI-action IWAs. The sets are disjoint in 40% of conversations, and in 96% of conversations there are more activities unique to one side than shared. In plain English: the user asks for one kind of work, and the AI often performs several different support activities to help.

That has a direct business consequence. If you are redesigning work around AI, you should not only ask which tasks employees want help with. You should also ask what the model is actually doing in the background. A salesperson may use AI to prepare for a client conversation; the model may be gathering information, drafting language, explaining technical details, and structuring a response. The workflow impact lives in that chain, not merely in the employee’s initial request.

Frequency alone would reward curiosity, not usefulness

The paper does not stop at counting what people ask Copilot to do. That would be a mistake, because usage can mean many things: curiosity, desperation, convenience, confusion, boredom, or a manager having just sent a memo with the words “AI-first” in it. Tragic, but measurable.

To separate usage from usefulness, the authors combine three ingredients.

First, activity share allocates each conversation across the IWAs assigned to it. If a conversation maps to several work activities, each gets a fraction of the conversation rather than a full count. This reduces double-counting.

Second, completion estimates whether the AI completed the user’s task. The authors use an LLM classifier for completion and validate its signal against explicit thumbs-up/thumbs-down feedback. The feedback data is useful but selected; not everyone gives feedback, and the people who do may not be a representative sample of humanity, as anyone who has read product reviews will understand too well.

Third, scope asks how much of the IWA the AI demonstrated it could assist or perform. This matters because an AI might help with a narrow fragment of a broad activity. “Analyze data using mathematics” can mean anything from explaining a formula to producing a reliable, auditable model for a regulated decision. One of those is a chatbot demo. The other is work.

The resulting applicability score is built from observed activity share, completion, and the share of conversations where scope is at least moderate. For AI actions, the authors also adjust for physical tasks, because an IWA such as “provide general assistance” can include activities that are partly informational and partly physical. Copilot can explain how to move a refrigerator. It cannot move the refrigerator, despite what procurement may hope after seeing the SaaS invoice.

The strongest signal is information work, not elite work

The findings are strongest where the work product is information: creating it, editing it, retrieving it, explaining it, translating it, packaging it, or communicating it to someone else.

At the activity level, common user goals include gathering information, obtaining information about goods or services, maintaining knowledge, reading documents, responding to customer problems or inquiries, providing information to the public or customers, explaining product or technical details, writing, editing, and creating visual designs. AI actions skew toward service-like activities: gathering information, maintaining knowledge, responding to customer inquiries, providing information, explaining product details, teaching subjects, preparing informational material, and offering general assistance.

The success pattern narrows this further. Copilot performs best on activities around communicating, teaching or explaining, and writing. It performs worse on visual design and data analysis tasks. That last point is worth pausing on, because it cuts against a common executive fantasy: “AI will do analysis.” Sometimes, yes. But in this paper’s observed Copilot conversations, data analysis is not where the strongest success signal appears. The machine is more reliable as a language-and-explanation instrument than as a fully trusted analytic worker.

At the occupation level, the top scores make the pattern concrete.

Occupation	AI applicability score	Practical reading
Interpreters and translators	0.492	Language transformation is directly inside the model’s strongest capability zone
Historians	0.462	Research, synthesis, and explanation-heavy work maps strongly to information activities
Writers and authors	0.454	Drafting, editing, and rewriting are core generative-AI use cases
Sales representatives of services	0.449	Product explanation, customer communication, and persuasion support are prominent
Customer service representatives	0.408	Inquiry response and information delivery are highly represented
Technical writers	0.373	Structured explanation and documentation fit the tool
Editors	0.367	Revision and language quality work are natural targets
Data scientists	0.357	Applicability exists, but the evidence is less “AI replaces analysts” than “AI helps with parts of analytical work”
Market research analysts and marketing specialists	0.350	Research synthesis and commercial communication overlap strongly with Copilot use
Management analysts	0.353	Information gathering, structuring, and recommendation support are plausible entry points

This is not a neat blue-collar versus white-collar split. Sales and office support roles score highly because they contain large volumes of information work. Some high-status occupations do not automatically dominate. The paper finds only a weak employment-weighted relationship between AI applicability and wage: $r = 0.13$. Without employment weighting, the relationship looks stronger, especially for AI actions, but that weaker weighted result is the one operators should take seriously. High-employment sales and administrative occupations complicate the tidy “AI hits only high-paid knowledge workers” story.

The same pattern holds for education. Applicability spans a broad range of educational requirements. The model is not politely checking degree certificates before helping people write, explain, search, and respond.

Low scores mostly mean physical work, not immunity

The bottom end of the ranking is also instructive. Low-scoring occupations include many jobs involving physical manipulation, equipment operation, direct bodily care, maintenance, and manual service: roofers, dishwashers, tire repairers, industrial truck operators, maids and housekeeping cleaners, phlebotomists, highway maintenance workers, and water and wastewater treatment plant operators.

That does not mean these jobs are “safe from AI” in any serious strategic sense. It means that the work activities observed in Copilot conversations overlap less with the core of those occupations. A model may still help with scheduling, compliance documentation, customer messaging, training, troubleshooting, or procurement. But the central work object is often a body, vehicle, pipe, machine, surface, room, or physical site. Current text-based Copilot cannot patch a roof by being eloquent about shingles.

The right conclusion is compositional. Even physically grounded occupations contain information components. Those components are where AI enters first. The mistake is to confuse the point of entry with the whole job.

User goals identify assistance; AI actions identify delegation

The paper’s user-goal and AI-action split becomes especially useful when comparing occupation groups.

Some groups show stronger applicability on the user-goal side. That suggests workers are using AI to perform their own tasks better or faster. Computer and mathematical occupations are an example: the worker remains active, using AI as part of an existing workflow.

Other groups skew toward AI-action applicability. That suggests greater potential for delegating bounded activities to the model. The paper highlights media and financial operations occupations as more likely to delegate tasks involving communication and training. Business and financial operations, management, and arts/design/entertainment/sports/media appear on the AI-action side in the paper’s Figure 3 logic, while food preparation and serving, installation/maintenance/repair, and protective service skew more toward user-goal assistance.

The distinction is subtle but commercially important.

Pattern	What it means	Business move
High user-goal applicability	Workers are likely using AI to assist their own activities	Embed AI into workflow tools, templates, search, drafting, review, and decision support
High AI-action applicability	AI is performing work-like activities inside the conversation	Identify bounded delegation opportunities with review gates
High overall applicability, balanced split	AI can both assist and perform parts of the work	Redesign the workflow around human review, exception handling, and quality control
Low applicability	Core activities are less represented in Copilot use	Focus on support functions rather than the job’s physical or relational centre

This is where the study becomes more useful than a ranking. The ranking tells you where to look. The split tells you how to intervene.

What the evidence supports, and what it does not

The paper contains several types of evidence. Treating them all as equivalent would be sloppy, and nobody needs more sloppy AI strategy. We have vendors for that.

Evidence or test	Likely purpose	What it supports	What it does not prove
100k uniform Copilot conversations	Main evidence	Representative picture of how U.S. Copilot users employ a mainstream chatbot for work-like activities	Actual productivity gains in workplaces
100k feedback-bearing conversations	Supporting evidence	Cross-checks which activities receive positive or negative user reactions	Unbiased satisfaction across all users
User-goal versus AI-action classification	Main methodological contribution	Separates assistance from performance/delegation	Clean prediction of automation versus augmentation outcomes
O*NET IWA mapping	Main measurement bridge	Links observed conversations to occupations through task structure	Complete representation of every job’s real value
Completion classifier	Success metric	Estimates whether conversations complete the user’s task at scale	Objective quality, compliance, or economic value
Scope classifier	Success-depth metric	Distinguishes narrow from broader demonstrated capability	Full capability across all workplace contexts
Threshold robustness checks	Robustness/sensitivity test	Shows relative occupation rankings are more stable than absolute “share affected” claims	A definitive absolute number of workers affected
Wage and education correlations	Comparison with prior expectations	Shows applicability is not simply a high-wage or high-education phenomenon	Distributional outcomes for wages or employment
Physical-task classifier	Implementation detail with validation	Prevents AI-action scores from over-crediting physical work	A complete model of robotics or embodied AI exposure
Information-work classifier	Interpretive support and robustness check	Helps explain why high-scoring occupations cluster around information creation and communication	A timeless taxonomy of work under future AI systems

The most important warning is about absolute estimates. The authors argue that usage-threshold choices make “what share of workers are affected?” numbers highly sensitive. Their score is designed for relative comparison, not for declaring that exactly some percentage of the workforce is now inside the automation blast radius. In other words, the paper refuses to produce the one number executives most want. This is a virtue, not a defect.

The business use case is a task audit, not a replacement list

The paper directly shows that Copilot usage concentrates in information work and that this concentration maps unevenly across occupations. Cognaptus’ business inference is that firms should use this kind of score as a task-audit instrument.

Start with roles, but do not end there. For each role, break work into recurring activities:

Information gathering
Document reading and summarisation
Writing, rewriting, editing, and formatting
Customer or stakeholder response
Explanation of products, policies, or technical details
Preparation of instructional or informational materials
Analysis support, with quality gates
Record creation, documentation, and reporting
Training and onboarding support
Exception handling, negotiation, judgement, and accountability

Then ask three questions for each activity.

First, is AI already good enough to assist the worker? This is where user-goal applicability matters. Drafting customer emails, preparing internal notes, explaining policies, summarising documents, and generating first-pass knowledge material are typical candidates.

Second, is AI good enough to perform a bounded subtask? This is where AI-action applicability matters. Answering standard product questions, creating initial training explanations, producing structured summaries, or preparing first drafts may be delegated if the process includes review and escalation.

Third, what remains human because it is relational, physical, accountable, legally sensitive, politically delicate, or simply too context-dependent? This is where management earns its keep, allegedly.

A practical implementation table might look like this:

Work activity	AI role	Control mechanism	Suitable KPI
Drafting routine customer replies	Assist or first-draft delegation	Human approval before send	Handle time, edit distance, complaint rate
Summarising long documents	Assist	Source links and spot checks	Time saved, error rate, reviewer confidence
Explaining internal policy	Assist	Approved knowledge base grounding	Escalation rate, consistency, employee satisfaction
Sales research preparation	Assist	CRM and source verification	Prep time, meeting quality, conversion support
Technical documentation	Assist	SME review and version control	Draft cycle time, defect reports
Data analysis support	Assist only unless validated	Reproducible code, audit logs, statistical review	Error rate, analyst time saved, model reliability
Physical service work	Peripheral assist	Mobile workflow integration	Admin time, scheduling accuracy, documentation completeness

The phrase “assist only unless validated” deserves to be printed and taped to several conference room doors.

The strongest near-term ROI is in work redesign, not model worship

The study suggests a clear ordering for AI adoption.

The first wave is not “replace the occupation.” It is “compress the information layer around the occupation.” That means faster drafting, faster search, faster explanation, faster translation, faster preparation, faster documentation, and faster response. These may or may not reduce headcount. More often, they change the bottleneck.

In customer service, for example, AI may draft responses and retrieve policy details, but the bottleneck may move to exception handling, empathy, system authority, or escalation rights. In sales, AI may prepare account summaries and product explanations, but the bottleneck may move to trust, timing, negotiation, and deal strategy. In technical writing, AI may accelerate first drafts, but the bottleneck may move to accuracy, architecture, and version discipline.

That is the useful sense in which job descriptions get rewritten. The title may stay the same. The valuable centre of the job shifts.

For managers, the wrong question is: “How many people can AI replace?” The better question is: “Which work activities become cheap enough that we should do more of them, and which human activities become more valuable once the information layer is automated?”

This is the ATM lesson the paper itself invokes. ATMs automated cash dispensing, but bank teller employment did not simply collapse; branch economics and teller responsibilities changed. The analogy is not a prediction that AI will preserve jobs. It is a warning that task automation and labour-market outcomes are mediated by business decisions. Firms decide whether to cut staff, expand service, lower prices, increase quality, open new channels, or simply produce more PowerPoint. History contains all options, including the depressing one.

The classifier is useful, but not magic

Because the paper relies heavily on LLM classification, its validation details matter.

The authors use GPT-4o to classify conversations into O*NET IWAs. The task is genuinely difficult: even human annotators only show moderate agreement. On the test set, human-human Cohen’s kappa ranges from 0.41 to 0.58 for user-goal classification and 0.49 to 0.57 for AI-action classification. The model’s agreement with human annotators is slightly lower but in the same rough range: 0.35 to 0.44 for user goals and 0.34 to 0.53 for AI actions.

That is not a fatal flaw. It is a reminder that the score is a statistical signal, not a divine transcript of workplace reality.

Scope classification is noisier. Human-LLM agreement is weaker than human-human agreement, and the paper is transparent about that. Still, scope contributes useful signal because it captures something completion does not: whether the demonstrated AI capability covers a meaningful portion of the work activity or only a narrow slice.

The physical-task adjustment is better validated. The authors use GPT-5 to label whether O*NET tasks require touching or moving people or objects, and two human annotators agree with the model at high rates, with Cohen’s kappa around 0.82 and 0.84. This matters because without that adjustment, a text chatbot would receive too much credit for work that sounds informational but is physically executed.

The result is a good measurement apparatus for relative occupational comparison. It is not an HR oracle. Those remain mostly built from spreadsheets and vibes.

Boundaries for serious use

The paper’s limitations are not decorative. They change how the results should be used.

First, the dataset reflects U.S. Bing Copilot conversations from January to September 2024. It is not all AI usage, not all workplace usage, and not all countries. Enterprise deployments, specialised tools, code assistants, domain-specific agents, and future multimodal systems may shift the frontier.

Second, the authors do not know users’ occupations. They observe work activities in conversations and map those activities to occupations through O*NET. That is a clever bridge, but still a bridge. It does not prove that a specific translator, sales representative, analyst, or teacher used Copilot in a particular way.

Third, O*NET is U.S.-centric and can lag actual work. It decomposes occupations into tasks and activities, but it cannot fully capture the connective tissue of work: judgement, sequencing, trust, coordination, tacit knowledge, political navigation, accountability, and knowing when the official process is fiction.

Fourth, completion and scope are not productivity. A conversation that appears complete may still be low quality, non-compliant, off-brand, insecure, or simply wrong in a subtle way. That is why firms need workflow-level measurement: time saved, error rates, review burden, customer outcomes, audit findings, and downstream rework.

Fifth, applicability is not destiny. High applicability marks where work is likely to change. Whether that change becomes augmentation, delegation, deskilling, upskilling, cost-cutting, service expansion, or organisational chaos depends on incentives and implementation. The model does not decide the operating model. Management does. Occasionally.

How to use the paper without abusing it

A responsible AI operating plan based on this paper would look like this:

Step	Action	Reason
1	Decompose roles into recurring work activities	Occupations are bundles; AI enters through activities
2	Mark information-heavy activities	The strongest observed signal is creation, processing, and communication of information
3	Separate assistance from delegation	User-goal and AI-action applicability imply different workflow designs
4	Pilot high-scope, high-completion activities first	Frequency alone is not enough
5	Add review gates where quality matters	Completion is not compliance or correctness
6	Measure workflow outcomes, not adoption theatre	Usage dashboards do not prove value
7	Reallocate human attention deliberately	The point is to move bottlenecks, not merely sprinkle chatbots around

The paper is especially useful for organisations drowning in vague AI mandates. It gives them a more disciplined starting point: map the information layer of work, identify where AI already demonstrates meaningful capability, and then design pilots around task-level economics.

That is not as glamorous as announcing an “AI transformation office.” It is also less likely to produce a dozen disconnected tools, three procurement disputes, and one lonely chatbot named after a Greek deity.

The quiet rewrite of job descriptions

The most important conclusion is not that AI is coming for jobs. It is that AI is already nibbling at the verbs inside jobs.

Gather. Explain. Draft. Edit. Translate. Respond. Summarise. Prepare. Advise. Maintain knowledge. Provide information.

Those verbs cut across occupations. They live inside customer service, sales, analysis, education, writing, administration, and management. They also sit at the edges of physical work, where AI may not perform the core task but can still reshape documentation, training, scheduling, and communication.

That is why the paper’s mechanism-first design matters. It does not hand us a prophecy. It gives us a measurement grammar. The unit is not the job title. The unit is the work activity, split between what the human is trying to do and what the AI actually does.

For business leaders, the implication is simple enough to be uncomfortable: stop asking whether AI will replace a role. Ask which activities in that role are becoming cheaper, faster, more scalable, or easier to delegate. Then ask what the human should do with the freed attention.

That second question is where strategy begins. The first one is just panic with a budget.

Cognaptus: Automate the Present, Incubate the Future.

Kiran Tomlinson, Sonia Jaffe, Will Wang, Scott Counts, and Siddharth Suri, “Working with AI: Measuring the Applicability of Generative AI to Occupations,” arXiv:2507.07935. ↩︎

TL;DR for operators#

The mistake is treating an applicability score as a layoff forecast#

The real unit of analysis is the conversation split in two#

Frequency alone would reward curiosity, not usefulness#

The strongest signal is information work, not elite work#

Low scores mostly mean physical work, not immunity#

User goals identify assistance; AI actions identify delegation#

What the evidence supports, and what it does not#

The business use case is a task audit, not a replacement list#

The strongest near-term ROI is in work redesign, not model worship#

The classifier is useful, but not magic#

Boundaries for serious use#

How to use the paper without abusing it#

The quiet rewrite of job descriptions#