TL;DR for operators
ChatGPT did not fail the writing task in this study. The humans did something more interesting: when allowed to use it, they reported doing less of the mentally expensive work.
The paper randomly assigned 40 participants to write a short argumentative essay either with ChatGPT 3.5 or without assistance. After the task, participants completed a four-item cognitive engagement scale covering deep understanding, effortful thinking, sustained attention, and exploration of alternative approaches. The ChatGPT group scored lower: 2.95 versus 4.19 on a five-point scale, with a statistically significant group effect.
That is not proof that AI makes students stupid, lazy, doomed, or whatever phrase gets the most engagement on LinkedIn this week. The study is small, self-reported, immediate, and limited to one writing task. It does not test essay quality, long-term learning, productivity, or better-designed AI scaffolding.
The useful operational lesson is sharper: AI changes the distribution of effort inside a task. If the tool supplies ideas, phrasing, and argument structure too early, the user may outsource the very cognitive work the task was supposed to train. For schools, corporate training, consulting, legal analysis, research, and management work, the question is no longer only “Did AI improve the output?” It is also “Which part of the thinking did the human still have to do?”
The experiment is small, but the contrast is clean
Writing is one of the few business-adjacent tasks where effort still leaves fingerprints. A weak argument usually reveals shallow reading. A strong one usually contains evidence of struggle: framing, comparison, revision, uncertainty, and the occasional sign that someone has been forced to think for longer than eleven seconds. Cruel, but educational.
Georgios P. Georgiou’s paper, ChatGPT produces more “lazy” thinkers: Evidence of cognitive engagement decline, tests a simple version of that situation.1 Forty participants were randomly split into two groups. One group wrote a structured argumentative text without assistance. The other completed the same task with access to ChatGPT 3.5.
The writing prompt was not obscure: participants had to argue for or against the statement that educational institutions should integrate AI tools into standard academic practice. They were asked to produce at least 300 words within a maximum of 30 minutes. The choice matters. Argumentative writing is not merely a typing exercise; it requires planning, position-taking, synthesis, and some capacity to hold competing ideas in mind without immediately outsourcing the discomfort.
The participants were not children discovering autocomplete for the first time. They were aged 25 to 47, with a mean age of 35.12. They were current or former students in linguistics, applied linguistics, or related language fields. All were native Greek speakers with at least a bachelor’s degree. They also had basic familiarity with AI chatbots, reporting average prior use of 16.4 hours per week. The groups did not significantly differ in prior AI use.
So the core comparison is not between AI sophisticates and AI novices. It is between similar participants completing the same argumentative task under two conditions: one with ChatGPT available for ideas, phrasing, or argument development, and one without external help.
That is the main evidence. There are no elaborate ablation studies, no multi-model comparison, no longitudinal follow-up, and no hidden benchmark maze. The paper’s strength is its directness. Its weakness is also its directness. A single contrast can be clear without being universal.
The measured drop is large enough to deserve attention
After writing, participants completed the CES-AI, a four-item self-report scale developed for the study. Each item was rated from 1 to 5, with higher scores indicating greater cognitive engagement.
The four items asked whether participants:
| CES-AI facet | What the item tries to capture |
|---|---|
| Deep processing | Whether the participant tried to understand the task deeply rather than skim it |
| Effortful thinking | Whether the participant thought through the problem themselves |
| Sustained attention | Whether the participant stayed mentally focused throughout the task |
| Strategic engagement | Whether the participant explored different ways to approach the task |
The ChatGPT group scored a mean of 2.95 with a standard deviation of 1.18. The control group scored 4.19 with a standard deviation of 0.45. The difference is 1.24 points on a five-point scale.
The paper then uses a one-way ANOVA, with average CES-AI score as the dependent variable and group as the independent variable. The result is statistically significant: $F(1,38)=19.2$, $p<0.001$.
That is the central finding. The participants who wrote without ChatGPT reported substantially higher cognitive engagement than those who wrote with it.
Two details are worth slowing down for.
First, this is not a tiny movement buried under statistical theatre. A 1.24-point difference on a five-point engagement scale is interpretable without needing ceremonial p-value incense. The control group sits above 4, which roughly corresponds to agreement that they were deeply engaged. The ChatGPT group sits below 3, closer to a neutral-to-low engagement region. The gap is practically meaningful, not merely publishable.
Second, the ChatGPT group has a much larger standard deviation: 1.18 versus 0.45. That suggests the AI-assisted condition may have produced more varied experiences. Some participants may still have engaged seriously while using the tool. Others may have let the machine absorb more of the work. The paper does not break this down further, so we should not pretend it does. But for operators, the variance is a quiet warning: “AI use” is not one behaviour. It is a container for many behaviours, ranging from productive sparring to elegant abdication.
CES-AI measures engagement, not intelligence
The paper’s title is deliberately sharp. “Lazy thinkers” is sticky. It will travel. It may also overperform relative to the evidence, because of course it will. Internet discourse has never met a nuance it could not flatten into a snack.
The study does not measure intelligence. It does not show that ChatGPT users became less capable. It does not show that their essays were worse. It does not show that their long-term learning declined. It measures immediate self-reported cognitive engagement after one task.
That distinction is not pedantic. It is the whole point.
The CES-AI scale asks participants to report their experience of thinking: whether they worked through the task, stayed focused, tried to understand it deeply, and explored alternative approaches. The scale has a reported Cronbach’s alpha of 0.88, which indicates strong internal consistency among the four items. In plain language: the items appear to hang together as a measure of a common construct.
But internal consistency is not magic. It does not turn a new four-item self-report scale into a full behavioural, neurological, and longitudinal account of cognition. Self-reports can be affected by inaccurate introspection, social desirability, task interpretation, or the participant’s own theory of what using AI means. The paper acknowledges this and calls for future work using neurophysiological recordings, behavioural indicators, think-aloud protocols, interviews, and larger samples.
So the right reading is not: “ChatGPT makes students lazy.”
The right reading is: “In this controlled writing task, participants with ChatGPT reported doing less of the deep, effortful, focused, strategic work that the unaided participants reported doing.”
That is less theatrical. It is also more useful.
The mechanism is cognitive offloading, not moral decay
The paper interprets the result through cognitive offloading: the tendency to shift mental work onto external tools. This is not new. Humans have always offloaded cognition. We use notebooks, calendars, calculators, spreadsheets, search engines, checklists, dashboards, and, in severe cases, consultants.
The difference with generative AI is the type of cognition being offloaded.
A calculator offloads arithmetic. A calendar offloads memory. A search engine offloads retrieval. ChatGPT can offload framing, wording, synthesis, argument generation, counterargument discovery, and even the first draft of judgement. That is a larger slice of the cognitive stack. Convenient, yes. Also slightly suspicious.
In the study, participants in the ChatGPT condition were explicitly allowed to consult the tool for ideas, phrasing, or argument development, while being encouraged not to rely solely on its suggestions. This is already more restrained than many real-world uses. No one told them to paste the prompt, accept the answer, and call it a personality. Yet the engagement score still dropped.
The plausible mechanism is not that ChatGPT hypnotised them. It is that the task’s hardest early moves became easier to bypass.
Argumentative writing usually forces a sequence:
- Understand the claim.
- Choose a position.
- Generate reasons.
- Organise evidence.
- Anticipate objections.
- Translate thought into language.
- Revise the structure.
When AI is introduced too early, it can supply candidate positions, reasons, phrasing, and structure before the learner has wrestled with the problem. The user may still edit. The user may still approve. But approval is not the same as construction.
That is the uncomfortable part for education and for business. Many organisations are rolling out AI tools precisely because they reduce friction. But some friction is not waste. Some friction is where the learning, judgement, and domain intuition are built. Remove it casually and you may get faster output with thinner human understanding. Efficient, in the same way that replacing the gym with a chairlift is efficient.
What the paper directly shows, and what it does not
The evidence is strong enough to inform design decisions. It is not strong enough to support civilisational panic. A useful reading separates the paper’s direct claims from the operational inferences.
| Layer | What is supported | What is not supported |
|---|---|---|
| Direct experimental result | ChatGPT-assisted participants reported lower cognitive engagement than unaided participants on the CES-AI after one argumentative writing task | ChatGPT universally reduces cognition across all users, tasks, and designs |
| Measurement contribution | CES-AI offers a task-specific four-item scale for self-reported engagement in AI-assisted writing | CES-AI is a fully validated general-purpose measure of learning quality |
| Mechanism | The results are consistent with cognitive offloading | The study directly observes the mental process of offloading in real time |
| Educational implication | AI integration should preserve active reflection, critique, and learner autonomy | AI should be banned from writing or learning tasks |
| Business implication | Organisations should measure process quality, not only output speed | Every AI-assisted workflow damages expertise |
This is where many AI debates become tedious. One side treats any negative result as proof that tools are corrupting the youth, the workforce, and probably sentence structure. The other side treats every limitation as permission to ignore the result entirely. Both moves are lazy. Ironically, that gives the paper’s title additional range.
The disciplined interpretation is narrower: the study gives experimental evidence that, under one common mode of AI assistance, users may feel less cognitively engaged while doing a demanding language task. That is exactly the kind of result that should affect how AI tools are introduced into learning and knowledge work.
The business lesson is workflow design, not anti-AI theatre
For business operators, this paper is less about students and more about the architecture of work.
Most enterprise AI adoption is still measured by output indicators: time saved, documents produced, tickets resolved, drafts generated, calls summarised, code completed. These metrics are easy to count, so naturally they become the dashboard. The fact that they may miss the decay of human judgement is inconvenient, and therefore usually postponed until after the vendor renewal.
But if AI reduces engagement in some tasks, organisations need a second measurement layer: cognitive process quality.
That means asking whether the human still had to:
- define the problem;
- generate an initial position before seeing AI output;
- compare alternatives;
- explain why one answer is better than another;
- challenge the AI’s assumptions;
- connect the output to domain-specific constraints;
- retain enough understanding to act without the tool.
This matters in any setting where the deliverable is not the only asset. In training, the asset is skill formation. In consulting, it is judgement. In legal work, it is reasoning under constraint. In research, it is interpretive discipline. In management, it is the ability to notice when the polished memo is wrong.
If AI produces the document while the human merely approves it, the organisation may gain throughput while losing apprenticeship. That trade-off might be acceptable for low-stakes formatting, summarisation, or first-pass clerical work. It is much more dangerous when the task exists partly to train the operator’s judgement.
Preserve the struggle, automate the sludge
The obvious response is to say, “Use AI as a tool, not a crutch.” This is true, but also useless. It belongs on a poster next to “Communicate better” and “Be proactive,” where good advice goes to die.
A more operational rule is: preserve the cognitive struggle that builds expertise, and automate the sludge that merely consumes time.
| Workflow stage | Bad AI insertion | Better AI insertion |
|---|---|---|
| Before thinking | Ask AI to generate the argument immediately | Ask the human to write a rough position first, then use AI to challenge it |
| During drafting | Let AI produce the structure and phrasing together | Use AI to surface missing counterarguments, weak assumptions, or unclear transitions |
| During review | Ask AI whether the draft is “good” | Ask AI to identify unsupported claims, alternative interpretations, and domain risks |
| After completion | Submit the polished output and move on | Require a short reflection: what changed, what was rejected, and why |
| In training | Grade only the final document | Assess process artefacts: outline, rejected options, critique notes, revision trail |
The difference is subtle but consequential. AI can either replace the learner’s first cognitive move or force a better second move. It can be an answer engine or a resistance machine. Most current deployments quietly choose the former because it feels efficient and demos beautifully. The latter is less glamorous, but it keeps the human awake.
For corporate use, this suggests a practical design pattern: require human pre-commitment before AI assistance. Before an analyst asks the model for a market view, they should write their initial hypothesis. Before a manager asks for a strategy memo, they should specify decision criteria. Before a trainee uses AI to draft an answer, they should outline the reasoning path.
Then the AI can be used to attack, expand, compare, or refine. That turns the tool from a substitute into a scaffold. Same technology. Different cognitive contract.
AI governance should include engagement metrics
The paper’s measurement approach is simple, which makes it useful for operators. The exact CES-AI scale should not be copy-pasted into every enterprise dashboard as if four Likert items can save civilisation. Still, the categories are a good starting point.
An organisation deploying AI into knowledge work can periodically ask workers or trainees:
- Did you understand the task more deeply after using AI, or did you move faster without understanding more?
- Did you think through the problem yourself before consulting the system?
- Did the tool help you stay focused, or did it encourage shallow acceptance?
- Did you explore more approaches, or simply accept the first plausible answer?
Those questions are not a replacement for performance measurement. They are a complement to it. Output quality, cycle time, error rate, customer satisfaction, and compliance outcomes still matter. But when organisations only measure output, they may miss the slow hollowing-out of capability.
This is especially relevant for junior employees. Senior staff can often use AI productively because they already have internal models of quality. They know what to reject. They can smell nonsense through formatting. Juniors may not have that filter yet. If AI gives them polished answers before they have built rough judgement, the learning ladder loses several rungs. Very convenient. Also a superb way to produce confident mediocrity at scale.
The boundary conditions are not small print
The limitations of the paper materially affect how far we can generalise.
The sample is small: 40 participants, 20 in each condition. The participants were educated, Greek-speaking adults from language-related academic backgrounds. The task was a short argumentative writing exercise. The AI system was ChatGPT 3.5. Engagement was measured immediately after the task using self-report. The paper does not evaluate the quality of the essays, the accuracy of arguments, the durability of learning, or whether different AI instructions would have produced a better engagement profile.
Those boundaries matter.
A well-designed AI tutor that asks questions before giving answers might increase engagement. A workflow that forces critique of AI-generated claims might deepen learning. A professional using AI after years of domain training might benefit without the same engagement loss. A team using AI for mechanical summarisation may face a very different risk profile from a student using it to generate arguments.
So the paper should not be used as an anti-AI cudgel. It should be used as a design warning: when AI is inserted at the point where human effort is supposed to occur, lower engagement is a plausible outcome.
That warning is enough.
The real risk is not laziness; it is silent deskilling
“Lazy” is an emotionally satisfying word. It is also a blunt instrument. In business settings, the sharper concept is silent deskilling.
Silent deskilling happens when people continue to produce acceptable outputs while losing the habits that created competence in the first place. The dashboard looks fine. The documents ship. The clients receive polished prose. The slide deck has gradients, because apparently we deserved that. Meanwhile, fewer humans are practising the underlying judgement.
This is not unique to AI. Automation has always changed skill formation. GPS changed navigation. Calculators changed arithmetic fluency. Spreadsheets changed financial modelling. The difference with generative AI is that it operates directly on language, reasoning, and explanation—the surface layer through which knowledge workers demonstrate thought.
That makes cognitive engagement a business risk, not an educational nicety.
If analysts stop forming independent hypotheses, the firm becomes more dependent on model-shaped consensus. If junior lawyers stop reasoning through cases before seeing generated arguments, review quality becomes harder to train. If consultants stop building issue trees and only edit AI-generated ones, the organisation may preserve deliverable volume while weakening its talent pipeline.
The first-order productivity gain is visible. The second-order capability loss is quieter. Naturally, that is the one executives are more likely to miss.
The conclusion: effort is now a design variable
This paper’s best contribution is not that it proves ChatGPT makes people lazy. It does not. Its value is that it gives a small but controlled piece of evidence for a larger operational truth: AI systems do not merely assist tasks; they redistribute cognitive effort inside them.
That redistribution can be good. It can remove drudgery, widen access, accelerate feedback, and help people see alternatives. It can also remove the productive difficulty that builds understanding.
The difference will not be decided by model capability alone. It will be decided by task design.
For schools, that means AI assignments should require pre-AI reasoning, critique of AI output, and reflection on rejected suggestions. For companies, it means AI governance should measure whether employees remain engaged in the parts of work where judgement is formed. For product teams, it means the best AI interface may not be the one that answers fastest. Sometimes the better tool is the one that makes the user think before it helps.
Convenience is not the enemy. Unexamined convenience is.
And if that sounds a little uncomfortable, good. That may be the cognitive engagement returning.
Cognaptus: Automate the Present, Incubate the Future.
-
Georgios P. Georgiou, “ChatGPT produces more ‘lazy’ thinkers: Evidence of cognitive engagement decline,” arXiv:2507.00181, 2025, https://arxiv.org/pdf/2507.00181. ↩︎