Work gets finished. The deck is sharper, the spreadsheet less embarrassing, the email sounds as though it passed through an adult, and the analyst who was supposed to spend three hours wrestling with a problem now appears serenely productive after forty minutes with ChatGPT.
This is the managerial dream version of AI: not replacing people, merely making them better. A polite fiction, but a useful one.
The harder question is whether “better” means more capable, or merely better equipped. A calculator makes arithmetic faster. It does not, by itself, produce a mathematician. A navigation app gets people across town. It does not necessarily improve their spatial memory; in some households it mostly improves the speed at which everyone can confidently take the wrong exit.
A recent pilot study, Efficiency Without Cognitive Change: Evidence from Human Interaction with Narrow AI Systems, lands exactly in this uncomfortable gap.1 The study does not ask whether ChatGPT can help people perform tasks. That part is almost boring now. It asks whether short-term use of ChatGPT changes the underlying cognitive abilities those tasks supposedly exercise: problem solving and verbal comprehension.
Its answer is useful precisely because it is not dramatic. AI helped participants perform better on several applied tasks. It did not produce measurable differential gains on standardized WAIS-III subtests after four weeks. In business terms: the tool improved the workflow, not the worker’s underlying cognitive machinery. At least not in a way this study could detect.
That distinction is where the entire article lives.
The mechanism: a scaffold is not a skill
The cleanest way to read the paper is not as an anti-AI warning. It is a test of where the intelligence is located during AI-assisted work.
There are two very different stories a manager might tell after watching a team use ChatGPT successfully.
The first story is capability growth: employees are learning to reason better, comprehend faster, and solve problems more effectively because AI interaction is training the mind. This is the attractive story. It lets the organisation record productivity gains as a form of upskilling. Convenient. Very board-deck friendly.
The second story is performance scaffolding: employees are completing tasks more efficiently because part of the work has moved outside the head. The AI supplies suggestions, structure, search, wording, examples, and sometimes answers. The human performance system improves, but the human alone may not.
The paper tests this distinction by separating task-level performance from standardized cognitive ability. That separation matters more than any single p-value. If a participant solves a crossword faster with ChatGPT, that tells us the combined human-AI system performed better. It does not tell us the participant’s verbal comprehension improved. To test that, the researchers used WAIS-III subtests before and after the intervention.
That is the central mechanism: AI can reduce cognitive load and redistribute effort without changing underlying ability. It can make the system look smarter while leaving the person roughly where they were. This is not a moral failure. It is how tools work. The mistake is calling every tool-mediated gain a human capability gain. That is how “AI transformation” slides quietly into “we bought everyone a faster crutch and called it leg day.”
What the study actually did
The study used a randomized 2 × 2 mixed factorial design: Group, either AI-assisted or control, by Time, pre-intervention and post-intervention. Thirty adults aged 18 to 45 were randomly assigned to the AI-assisted condition or the non-assisted control condition, with 15 participants in each group.
At baseline, participants completed Raven’s Progressive Matrices to check fluid intelligence comparability. They also completed four WAIS-III subtests: Picture Completion, Arithmetic, Similarities, and Vocabulary. The paper treats these WAIS subtests as standardized indicators of problem solving and verbal comprehension.
Then came the intervention. Over four weeks, participants completed eight programmed cognitive activities, two sessions per week. The activities included crossword puzzles, problem-solving tasks, a word-guessing game, lateral thinking, trivia, Tower of Hanoi, reading comprehension, and brainstorming. The AI-assisted group had unrestricted access to ChatGPT during these activities. The control group completed the same activities without AI assistance.
This is important: the AI condition was deliberately ecological rather than tightly scripted. Participants could use ChatGPT as they wished. No prompts were specified, monitored, or recorded. That makes the setup closer to real-world adoption, where employees are rarely polite enough to use tools according to a researcher’s taxonomy. It also limits mechanistic interpretation, because we cannot know whether participants used ChatGPT for direct answers, hints, strategy, validation, rewriting, or some mixture of all the above.
After the intervention, participants repeated the WAIS-III subtests. The critical test was not whether everyone improved from pre to post. Practice effects and familiarity can do that. The critical test was the Group × Time interaction: did the AI group improve more than the control group on standardized cognitive measures?
They did not.
The main evidence: task performance rose, ability measures did not
The paper’s evidence has two layers. The first layer is task performance during the intervention. The second is standardized cognitive change from pre to post. Confusing these two layers is the reader misconception the paper quietly dismantles.
On applied tasks, ChatGPT helped. AI-assisted participants achieved higher accuracy on the crossword task, with a large reported effect size: $t(29) = -5.67$, $p < .001$, $d = 1.25$. They also performed better on the problem-solving task: $t(29) = -4.32$, $p < .001$, $d = 0.98$. In the word-guessing game, AI assistance improved both accuracy, $t(29) = -2.44$, $p = .021$, $d = 0.58$, and response time, $t(29) = 3.48$, $p = .002$, $d = 0.80$. Trivia showed the largest reported effect: $t(29) = -6.57$, $p < .001$, $d = 1.45$.
Tower of Hanoi showed only a trend toward faster completion in the AI-assisted group: $t(29) = 2.03$, $p = .051$, $d = 0.46$. The paper reports no significant group differences for lateral thinking, Tower of Hanoi, and brainstorming activities, or for completion times on several tasks including crossword, problem-solving, lateral thinking, trivia, and brainstorming.
So the performance picture is not “AI improves everything.” It is more specific: AI improved several structured, applied tasks, especially where external language, retrieval, semantic support, or problem structuring could plausibly help. It did not uniformly improve every activity, and some speed outcomes were not significant.
The standardized cognitive results are the more important part. Across the WAIS-III subtests, the researchers found no significant Group × Time interactions:
| WAIS-III subtest | Group × Time result | Interpretation |
|---|---|---|
| Picture Completion | $F(1, 29) = 0.30$, $p = .587$, $\eta_p^2 = .010$ | No evidence that AI users improved more than controls |
| Arithmetic | $F(1, 29) = 0.50$, $p = .483$, $\eta_p^2 = .017$ | No differential gain from AI exposure |
| Vocabulary | $F(1, 29) = 0.11$, $p = .743$, $\eta_p^2 = .004$ | No differential gain from AI exposure |
| Similarities | $F(1, 29) = 0.70$, $p = .408$, $\eta_p^2 = .024$ | No differential gain from AI exposure |
There were main effects of Time for Picture Completion and Arithmetic, reflecting improvement across both groups, and a main effect of Time for Vocabulary, reflecting a decline across both groups. But that is not evidence that ChatGPT improved cognition. It is evidence that something changed over time across participants, regardless of AI condition. The AI-specific claim depends on differential improvement, and the paper did not find it.
That is the useful result: the combined human-AI system got better at certain tasks, while the measured human cognitive abilities did not show corresponding short-term enhancement.
The figures are visual confirmation, not a second thesis
The figures in the paper mainly serve as evidence presentation rather than separate analyses. Figures 1 and 2 show individual pre-post trajectories for Picture Completion and Arithmetic in the control and AI-assisted groups. Their likely purpose is main-evidence visualization: to show that both groups changed over time, but without a clear AI-specific pattern.
Figures 3 through 8 show group comparisons on intervention tasks. Their likely purpose is also main-evidence visualization: to make the task-performance advantage visible where it exists. They are not robustness tests, ablations, or variant experiments. There is no hidden secondary methodology in the charts. They support the paper’s core dissociation: task efficiency improved in several places; standardized ability did not.
That matters because readers often overread visuals. A taller bar on a trivia chart is not a theory of mind. It is a taller bar on a trivia chart. Useful, yes. Transformative, no. AI-assisted participants answered more accurately under that task design. The chart does not show durable learning, transfer, or independence from the tool.
This is where the paper is more disciplined than many AI adoption conversations. It refuses to let assisted output masquerade as internal capability.
Why ChatGPT helped where it helped
The most plausible mechanism is cognitive offloading. ChatGPT can absorb operations that would otherwise consume working memory, retrieval effort, or linguistic search. It can suggest candidate answers, organise possibilities, rephrase questions, provide semantic associations, or reduce the time spent staring blankly at a clue while pretending that thinking is happening.
For a crossword, word-guessing game, trivia item, or structured problem-solving task, that is directly useful. These tasks reward access to language, retrieval, pattern generation, and quick narrowing of possibilities. ChatGPT is rather good at producing candidates in exactly those zones. The participant still has to judge and apply output, but the search space is no longer entirely internal.
This does not mean the participant’s mind has become more capable. It means the task environment has changed. The relevant unit is no longer “human cognition unaided”; it is “human cognition plus external language model plus interface plus prompt behaviour.” In many workplaces, that is the real production unit now. Fine. But managers should name it correctly.
A useful operational distinction looks like this:
| What improves | What it means | What it does not prove |
|---|---|---|
| Assisted task accuracy | The human-AI system can produce better answers in that context | The human has acquired the underlying skill |
| Assisted task speed | The workflow has lower friction | The worker can perform equally well without AI |
| Better structured output | The tool supplies organisation and language support | The employee has improved reasoning architecture |
| Repeated AI use | The employee gains tool fluency | The employee has gained transferable cognitive ability |
Tool fluency is valuable. It may even be the correct training target in many roles. But it is not the same as reasoning ability. The difference becomes expensive when organisations design assessments, promotions, training budgets, or compliance processes around the wrong construct.
The business lesson: measure the system, then test the person without the scaffold
For companies adopting generative AI, the paper suggests a simple but frequently ignored principle: decide whether you are buying productivity, training, or both.
If the goal is productivity, AI-assisted performance is the right metric. Measure cycle time, error rates, rework, customer satisfaction, output quality, and throughput. If the human-AI system performs better, the business case may be real. There is no need to pretend everyone has become cognitively upgraded. Efficiency is not a small thing; entire operating models are built on it.
If the goal is upskilling, assisted performance is not enough. You need transfer tests. Can employees solve related problems without AI? Can they explain their reasoning? Can they detect faulty AI output? Can they perform under tool outage, audit conditions, or high-stakes exception handling? Can they generalise from AI-supported examples to novel cases?
This distinction changes how AI training should be evaluated.
| Business question | Wrong metric | Better metric |
|---|---|---|
| Did AI improve productivity? | Employee self-reports of usefulness | Assisted workflow throughput, accuracy, rework, and completion time |
| Did AI improve employee capability? | Quality of AI-assisted output | Unaided post-training performance and transfer to new tasks |
| Did employees learn to use AI well? | Number of prompts submitted | Prompt quality, verification behaviour, error detection, and escalation judgement |
| Did AI reduce cognitive burden safely? | Faster completion alone | Faster completion plus maintained comprehension and accountability |
| Did AI create hidden dependency? | Adoption rate | Performance drop when the tool is removed or restricted |
The uncomfortable practical implication is that many AI programmes currently marketed as “upskilling” may actually be “scaffold deployment.” Again, not bad. Just different. A forklift does not upskill the human back muscles. It changes the work system. Nobody objects because nobody is silly enough to call warehouse mechanisation a lumbar development strategy. With AI, apparently, everyone suddenly became poetic.
The paper’s strongest contribution is conceptual discipline
This is a small study, but it usefully enforces three separations that business AI debates often blur.
First, it separates performance from ability. Performance is what happens in a task environment. Ability is what a person can do across contexts, especially when support is absent or changed. AI can improve the first without altering the second.
Second, it separates tool benefit from learning. A tool can be beneficial even if it does not train the user. This should be obvious, yet AI discourse keeps smuggling pedagogy into productivity software. A person may become excellent at using ChatGPT to complete a report while becoming no better at writing the report independently. Depending on the job, that may be acceptable. Depending on the risk environment, it may be reckless.
Third, it separates ecological realism from mechanistic visibility. The study’s unrestricted ChatGPT condition resembles real-world use, but because prompts were not logged, it cannot tell us which styles of use produced the gains. Shallow offloading and reflective offloading are very different behaviours. One asks the model for answers. The other asks for structure, critique, alternative hypotheses, or checks against one’s own reasoning. Both may improve task output. Only one plausibly supports learning.
This distinction should become part of AI governance. Not all usage is equal. “Used AI” is not a meaningful behavioural category, just as “used the internet” tells us nothing about whether someone read a primary source, copied a paragraph, or watched a raccoon steal cat food.
Where the study should not be overclaimed
The study is best treated as pilot evidence, not a universal law of human cognition in the AI age.
The sample is small: 30 participants, recruited through online and snowball methods, with 15 per group. That limits statistical power and generalisability. The absence of significant cognitive change is not proof that AI can never support cognitive development. It means this study did not detect differential gains under this design.
The intervention was short: four weeks of AI-assisted task activity within a seven-week protocol. Durable changes in reasoning, verbal comprehension, or executive function may require longer, more adaptive, feedback-rich training. A month of using ChatGPT during structured tasks is not the same as a year of deliberately coached AI-supported problem solving.
The cognitive measures were limited to selected WAIS-III subtests. Those are standardized instruments, which is a strength, but they do not exhaust the universe of cognition. The study cannot rule out changes in metacognition, confidence calibration, domain-specific knowledge, prompt strategy, verification habits, or collaborative reasoning. It measured specific cognitive outcomes, not every possible human adaptation to AI.
The unrestricted AI condition is realistic but opaque. Since prompts were not recorded, the study cannot distinguish whether gains came from answer retrieval, language scaffolding, strategic prompting, metacognitive reflection, or simple shortcutting. For business use, that is a major boundary. Two employees may both “use ChatGPT” and engage in completely different cognitive behaviours.
Finally, time-on-task may have differed between groups. If AI users completed tasks faster, they may also have spent less active thinking time. That creates a tricky interpretation problem: AI might improve output while reducing practice. For productivity, splendid. For learning, possibly less splendid.
These limitations do not weaken the article’s business lesson. They sharpen it. The right conclusion is not “AI does not improve cognition.” The right conclusion is: do not infer cognitive improvement from assisted task improvement unless you have measured transfer, retention, and unaided performance.
A better AI adoption model: productivity first, capability by design
The study points toward a more mature operating model for AI adoption.
Start by treating AI as workflow infrastructure. It helps people search, draft, compare, summarise, generate options, and compress routine effort. Evaluate it like infrastructure: does it reduce bottlenecks, improve quality, and lower operational cost without increasing hidden risk?
Then, separately, decide whether the organisation wants AI to support learning. If it does, the system has to be designed differently. It should not merely provide answers. It should require prediction before assistance, ask users to explain reasoning, provide feedback on errors, show alternative solution paths, and occasionally remove the scaffold to test retention. Annoying, yes. Also known as learning.
The difference can be expressed bluntly:
| Adoption mode | Tool behaviour | Human behaviour | Likely outcome |
|---|---|---|---|
| Productivity scaffold | Generate, retrieve, complete, summarise | Select, edit, approve | Faster output |
| Reflective assistant | Question, critique, compare, explain | Reason, revise, justify | Better judgement, possibly |
| Training system | Sequence difficulty, test recall, remove support | Practise, fail, transfer | Actual skill development |
| Dependency engine | Answer everything immediately | Accept, copy, forget | Smooth performance, fragile capability |
Most organisations are building the first and accidentally drifting into the fourth while describing both as the second. There is the little comedy.
The business opportunity is not to reject scaffolding. It is to label it properly and build complementary capability checks. A company can enjoy AI-driven efficiency while still protecting human expertise. But that requires assessment design, not vibes.
For example, an AI-enabled legal team might use models to accelerate clause comparison, but still test lawyers on unaided issue spotting. A finance team might use AI for variance commentary, but still require analysts to explain causal drivers without model-generated prose. A customer support centre might use AI-generated replies, but still audit whether agents can recognise when escalation is needed. In each case, the scaffold is useful; the human capability remains separately measured.
The quiet warning: efficiency can hide skill decay
The paper does not prove AI causes cognitive decline. It does, however, clarify a pathway by which organisations could miss decline if they look only at output metrics.
When tools improve performance, managers see success. Cycle times fall. Quality scores rise. People feel empowered. The dashboard glows with the soft light of operational convenience. Meanwhile, the organisation may stop exercising certain internal capabilities because the tool now supplies them. That may be rational for routine work. It may be dangerous for judgement-heavy work.
The risk is not that employees become stupid. That is too crude. The risk is that the organisation gradually loses visibility into what people can do without the scaffold. Capability becomes embedded in the toolchain, not the workforce. When the tool fails, hallucinates, changes, becomes unavailable, or encounters an edge case, the human is expected to reappear as an expert. This is charmingly optimistic.
The solution is not nostalgia for unaided labour. It is better instrumentation. Measure the combined system for productivity. Measure the human separately for competence. Measure the interaction for dependency. And design AI use patterns that preserve effort where effort is the point.
Conclusion: AI makes the work smarter before it makes the worker wiser
The paper’s best contribution is not that ChatGPT helped people complete tasks. Anyone who has watched a competent user draft, search, summarise, and reformat with AI already knows that. Its contribution is the sharper distinction: assisted performance can improve without measurable short-term change in standardized cognitive ability.
That distinction should reshape how businesses talk about AI transformation. AI can be a productivity amplifier, a cognitive scaffold, a training partner, or a dependency machine. Which one it becomes depends less on the model’s marketing page than on the workflow, incentives, assessments, and habits wrapped around it.
So yes, use AI to accelerate work. Use it to reduce pointless friction. Use it to give employees better starting points and fewer blank pages. But do not confuse smoother execution with deeper understanding. The spreadsheet may be cleaner, the memo may be sharper, and the answer may arrive faster. The mind behind it may not have moved very far.
Smarter work is valuable. Wiser workers require something more irritating: practice, reflection, feedback, and the occasional removal of the crutch.
Cognaptus: Automate the Present, Incubate the Future.
-
María Angélica Benítez, Rocío Candela Ceballos, Karina Del Valle, Mundo Araujo Sofía, Sofía Evangelina Victorio Villaroel, and Nadia Justel, “Efficiency Without Cognitive Change: Evidence from Human Interaction with Narrow AI Systems,” arXiv:2510.24893, 2025. https://arxiv.org/pdf/2510.24893 ↩︎