Beyond the Pull Request: What ChatGPT Teaches Us About Productivity

TL;DR for operators

Most companies still ask the wrong first question about LLMs in software development: “Do they make developers write code faster?”

That question is not useless. It is just too small. A recent paper by Sardar Bonabi, Sarah Bana, Vijay Gurbaxani, and Tingting Nian uses Italy’s temporary 2023 ChatGPT ban as a natural experiment to examine what happened to public GitHub activity when Italian developers abruptly lost access to ChatGPT, compared with similar developers in France and Portugal.¹ The study covers 88,022 open-source software developers and looks at a 16-week window: eight weeks before the ban, four weeks during it, and four weeks after access was restored.

The headline result is tidy enough for a dashboard: losing ChatGPT access was associated with a 6.4% drop in developer productivity and an 8.4% drop in skill acquisition. After access returned, knowledge-sharing activity rose by 9.6%. Fine. Put that in the quarterly AI slide if one must.

The more useful finding is structural. ChatGPT did not appear to act as one generic “coding assistant.” It behaved more like infrastructure that supports different developer groups in different ways. Novices leaned on it for direct productivity. Intermediate developers used it more for learning and knowledge sharing. Advanced developers showed less measurable movement. Apparently, experience still matters. Terrible news for people hoping to replace judgment with a subscription tier.

For operators, the lesson is simple: LLM adoption should not be managed only through output metrics such as commits, pull requests, or tickets closed. Those are the visible surface. The deeper value sits in onboarding, codebase comprehension, peer feedback, and the ability to pick up new technologies without waiting for the one senior engineer who understands the ancient build system to return from holiday.

The caution is equally important. This is not proof that every enterprise team will receive the same percentage gains. The paper studies public GitHub activity, self-reported user locations, and an unusually clean access shock during the early ChatGPT era. It is best read as evidence about mechanisms, not a universal ROI calculator with academic stationery.

The office version of the experiment: turn off the assistant and see what breaks

Imagine a software team where ChatGPT disappears tomorrow morning. Not because procurement forgot to renew the licence, although let us not rule out realism. Access simply vanishes.

What changes first?

The obvious guess is code output. Developers write fewer commits. Pull requests slow down. New repositories are not created as often. That is the visible machinery of software work, so it naturally attracts managerial attention.

But modern development is not just typing code into a repository. It is also reading unfamiliar code, explaining decisions to others, reviewing changes, reporting issues clearly, learning a framework that has apparently changed its configuration format again, and translating half-remembered documentation into working implementation. Pull requests are only the theatre curtain. The production backstage is messier.

That is why the paper is more interesting than the usual “AI improves productivity” study. It asks whether LLMs change three layers of open-source development:

Layer of work	GitHub proxy used in the paper	What it represents operationally
Code productivity	Repository creation, commits, pull requests	Direct contribution to project progress
Knowledge sharing	Pull request reviews, issue reports, discussions	Collaborative feedback, coordination, and peer learning
Skill acquisition	Use of new programming languages not previously used by the developer	Practical expansion into new technical domains

This framing matters because software organisations often measure the first layer and then pretend they have measured productivity. Very tidy. Also incomplete.

The paper’s mechanism-first contribution is that LLMs appear to affect the whole developer production system, not merely the act of producing code. Code is the easiest thing to count. It is not always the hardest thing to produce.

The natural experiment was not adoption; it was interruption

Most AI productivity studies face a familiar problem: people who adopt a tool early are often different from people who do not. They may be more motivated, more experimental, more senior, more technically curious, or simply more willing to tolerate terrible onboarding screens. If those people perform better, it is hard to know whether the tool helped or whether the adopters were already unusual.

This study avoids part of that problem by examining a sudden interruption. On March 31, 2023, Italy temporarily banned ChatGPT over data privacy concerns. OpenAI disabled access for users in Italy. The ban was lifted on April 28, 2023, after the company addressed the regulator’s concerns.

That timing gives the authors a cleaner empirical setup. Italy becomes the treated group. France and Portugal become the control group. The authors compare GitHub activity before, during, and after the ban using a difference-in-differences design with user and week fixed effects. They also apply propensity score matching to make treated and control users more comparable on pre-ban characteristics such as profile information, activity patterns, and programming language use.

The model estimates incidence rate ratios. In plain English: values below 1 mean the activity rate fell relative to the baseline; values above 1 mean it rose. A coefficient of 0.936 corresponds to a 6.4% decrease. No interpretive acrobatics required, which is always a pleasant surprise.

The study’s identification rests on a reasonable but not magical assumption: absent the ban, Italian developers would have followed similar activity trends to comparable developers in France and Portugal. The authors test pre-ban trends and find no statistically significant pre-treatment coefficients across the three main outcomes. That does not make causality invincible. It does make the design more credible than a survey asking developers whether AI “makes them feel productive,” a method best reserved for measuring vibes.

The first mechanism: ChatGPT as a direct productivity layer

The most straightforward result is the productivity effect. During the ban, Italian developers’ productivity fell by 6.4% relative to the matched control group. Productivity here means the sum of repository creation, code commits, and pull requests.

This is the part most executives already expect. ChatGPT can help developers draft code, refactor, explain errors, generate tests, structure pull request descriptions, and reduce friction when starting or modifying projects. Removing it makes some work slower.

Still, the result is worth reading carefully. The paper does not show that ChatGPT permanently raised productivity above pre-ban levels after access returned. The after-lift productivity coefficient is not statistically significant. The stronger interpretation is not “ChatGPT creates endless compounding code output.” It is more modest and more useful: by the time the ban happened, ChatGPT had already become embedded enough in some developers’ workflows that removing it caused measurable drag.

That distinction matters for business planning. A tool can be valuable infrastructure even if restoration does not produce a dramatic celebratory spike. Nobody expects the office network to produce a productivity miracle when it comes back online. The point is that work was being organised around it.

The paper also reports an appendix result on project initiation: after the ban was lifted, repository creation increased by 9.9%. This is best treated as a supporting subcomponent, not a second main thesis. It suggests that ChatGPT may lower the barrier to starting projects, but the main productivity measure is broader and more reliable.

The second mechanism: ChatGPT as a collaboration amplifier

The knowledge-sharing result is less symmetrical and therefore more interesting.

During the ban, the effect on knowledge sharing is not statistically significant. Developers did not clearly reduce reviews, issue reports, or discussions while ChatGPT was unavailable. But after the ban was lifted, knowledge-sharing activity increased by 9.6%.

That pattern suggests that ChatGPT’s collaborative value may not be a simple “remove tool, reduce collaboration” relationship. The authors interpret the post-lift increase as possible “released capacity”: once developers regained access, ChatGPT may have reduced enough cognitive load that they could spend more effort reviewing others’ code, writing issues, and participating in discussions.

This is where the coding-assistant narrative becomes too narrow. LLMs are often described as tools for individual output, but collaboration has its own bottlenecks. Reviewing a pull request requires understanding someone else’s code. Opening a useful issue requires reproducing a problem, describing it precisely, and communicating across skill levels and sometimes languages. Participating in a technical discussion requires enough context to avoid saying something confidently useless, a proud human tradition.

ChatGPT may support these activities by explaining unfamiliar code, helping structure feedback, translating technical intent into clearer English, and lowering the effort needed to engage with an unfamiliar repository. In an open-source setting, where contribution is voluntary, even small reductions in friction can change whether someone bothers to review, comment, or report.

For enterprise teams, the implication is not that LLMs magically create collaboration. They do not schedule emotionally mature architecture reviews. Let us remain grounded. But they may reduce the cognitive entry cost of collaboration. That is a different form of productivity, and most dashboards are bad at seeing it.

The third mechanism: ChatGPT as a learning scaffold

The paper’s skill acquisition measure is clever and imperfect in the usual useful way. The authors track whether developers use programming languages they had not previously used in public repositories. If a developer starts using a new language during the observation window, that counts toward skill acquisition.

During the ban, skill acquisition fell by 8.4%. The after-lift effect was not statistically significant.

This does not mean developers forgot how to learn. It means ChatGPT appears to have supported practical entry into new technical domains. That is a different claim from classroom learning outcomes or exam performance. The paper is measuring applied adoption: whether developers actually used new languages in projects.

The mechanism is plausible. LLMs can explain syntax, translate concepts from a familiar language into an unfamiliar one, generate contextual examples, and help debug error messages. This is especially valuable when the learner has enough foundation to ask useful questions but not enough fluency to move quickly alone.

That is why the heterogeneity results matter.

Novices need the scaffold; intermediates use it to climb

The average effects hide an important division by experience. The authors use GitHub tenure as a proxy for developer experience and divide users into novice, intermediate, and advanced groups.

The pattern is not “AI helps everyone equally.” That would have been convenient, and therefore suspicious.

Developer group	Main observed effect	Interpretation
Novice developers	Productivity fell 15.2% during the ban	ChatGPT appears to support direct contribution and onboarding-like work
Intermediate developers	Skill acquisition fell 15.2% during the ban; knowledge sharing rose 22.3% after access returned	ChatGPT appears to support learning expansion and collaborative contribution
Advanced developers	No major significant effect across the main dimensions	Existing expertise may substitute for LLM support, at least in this setting

This is one of the paper’s most operationally useful findings. Novices seem to rely on ChatGPT to maintain output. Intermediate developers seem to use it less as a crutch and more as leverage: a way to learn new languages, understand unfamiliar contexts, and participate more in knowledge exchange.

That middle layer is strategically important inside companies. Intermediate developers are often the connective tissue of engineering organisations. They review code, onboard juniors, translate architectural intent into implementation, and absorb new frameworks before they become formal training material. If LLMs make that group more capable of sharing knowledge and acquiring skills, the value compounds through the team.

Advanced developers, by contrast, may already have the mental models and debugging habits that ChatGPT provides to others. The paper does not show that experts receive no value from LLMs. It only shows that, in this GitHub activity window, their measured public activity did not move significantly in the same way. Experts may use LLMs for tasks not captured here: design exploration, documentation, private work, migration planning, or swearing less at YAML. Measurement has limits.

LLM-assisted learning matters most where documentation is fragmented or the terrain is hard

The paper then asks a sharper question: what kinds of learning are most affected?

The authors focus on intermediate developers because the earlier heterogeneity analysis showed that this group had the clearest skill-acquisition effect. They classify the top 50 programming languages into seven clusters: general-purpose, web development, system programming, scientific computing, DevOps and configuration, templating and markup, and domain-specific languages.

The ban had the largest negative effects in three areas:

Language cluster	Ban-period effect on intermediate users’ new language acquisition	Why this likely matters
Web development	-30.8%	Multiple interacting languages and frameworks create cross-context complexity
System programming	-50.1%	Low-level concepts, memory issues, compiler errors, and unfamiliar abstractions raise learning costs
Domain-specific languages	-64.5%	Documentation and community support can be fragmented, fast-changing, or specialised

The authors caution that these dependent variables have low means, so the large percentages should be interpreted mainly as directional and comparative rather than as large absolute movements. This is a sensible warning. A big percentage change on a sparse activity measure can look more dramatic than it feels in daily work.

Still, the relative pattern is useful. General-purpose languages with abundant resources, such as Python or Java, did not show the same measurable dependence on ChatGPT. Developers can rely on documentation, tutorials, Stack Overflow archives, and active communities. In more fragmented or technically demanding areas, ChatGPT may act as a translator between scattered information and the developer’s immediate problem.

This is the business lesson: LLMs are not equally valuable across all training needs. They are most useful where the learning environment is messy, context-specific, poorly documented, or moving faster than formal education can keep up. In other words, exactly where companies usually discover their official training material is three versions behind reality.

The robustness checks support the main story, but they are not a second thesis

The paper includes several checks that are worth separating from the main evidence. They are there to protect the identification strategy, not to launch a parallel article.

Test or analysis	Likely purpose	What it supports	What it does not prove
Pre-ban trend test	Identification check	Italy and control countries followed comparable pre-treatment patterns across the main outcomes	That every unobserved shock is eliminated
Propensity score matching	Sample comparability	Treated and control developers are more similar on observable pre-ban traits	That unobservable differences disappear
GitHub Copilot-covered vs non-covered repositories	Robustness against substitution	No strong evidence that developers simply replaced ChatGPT with Copilot	That no individual developer used Copilot
First-week ban restriction	Robustness against VPN circumvention	Results remain directionally similar when limiting the window before widespread circumvention could develop	That nobody used a VPN
Language-cluster analysis	Exploratory extension of skill-acquisition mechanism	LLM learning support appears strongest in complex or fragmented language domains	That every language in a cluster behaves identically
Project initiation appendix result	Supporting subcomponent	Repository creation increased after restoration	That project creation alone explains the whole productivity effect

The strongest robustness point is substitution. If Italian developers had simply moved from ChatGPT to another tool, the estimated effect of the ban would be misleading. The authors address this in two ways.

First, they argue that alternative general-purpose LLMs were not publicly available in Europe during the observation window in a way that would plausibly substitute for ChatGPT. Second, they compare repositories in languages covered by GitHub Copilot with repositories in languages not covered by it. If developers broadly substituted Copilot for ChatGPT, the effects should look weaker in Copilot-covered repositories. The results remain broadly consistent.

VPNs are another concern. Some Italian developers may have bypassed the ban. The authors note that if this happened, it would likely attenuate the measured effects rather than inflate them, because some supposedly treated users would still have access. They then restrict the analysis to the first week of the ban, when large-scale circumvention would be less likely, and find results broadly consistent with the main estimates.

That does not make the estimates perfect. It does make the obvious objections less damaging.

What the paper shows, what Cognaptus infers, and what remains uncertain

A useful operator should not confuse evidence with enthusiasm. The paper shows some things directly. Other implications require interpretation.

Category	What belongs here
Direct paper evidence	In public GitHub OSS activity, losing ChatGPT access in Italy was associated with lower productivity and skill acquisition relative to matched developers in France and Portugal; restored access was associated with higher knowledge sharing.
Reasonable Cognaptus inference	LLMs should be treated as part of the developer operating environment, not merely as code-completion utilities. Their value includes onboarding, collaboration, and skill diffusion.
Still uncertain	Whether the same magnitudes apply inside private enterprise repositories, regulated development teams, teams using modern multi-tool AI stacks, or organisations with different engineering cultures.
Practical caution	If teams become dependent on external LLM providers, outages or policy disruptions may disproportionately affect junior developers and onboarding-heavy teams.

This distinction matters because the enterprise version of “LLM productivity” is not a single number. A team may see little change in commits but meaningful improvement in onboarding time. Another may see faster issue triage but no obvious lift in pull request volume. A third may see juniors ship more quickly while senior review burden changes in ways the metrics fail to capture.

The paper’s evidence suggests that managers should instrument multiple layers of developer work, not just the easiest layer to count.

How operators should redesign the AI productivity dashboard

The most useful response to this paper is not “buy ChatGPT for everyone.” That may be right; it may also be the kind of strategy one gets from a procurement form with ambition.

A better response is to map LLM value by workforce segment and work mechanism.

For junior-heavy teams, measure onboarding velocity. How long does it take a new developer to make a meaningful contribution? How many review cycles do their first pull requests require? Are they able to navigate unfamiliar codebases without constant senior intervention?

For intermediate-heavy teams, measure knowledge diffusion. Are mid-level developers reviewing more effectively? Are issues clearer? Are design discussions better grounded? Are more people able to contribute to unfamiliar repositories or languages?

For teams facing technology shifts, measure skill adoption. Which new languages, frameworks, or infrastructure tools are being used in actual project work? Are developers moving into complex domains faster, or merely asking the chatbot for explanations that never reach production?

For risk management, measure dependency. Which teams lose the most capacity when LLM access is degraded? Do juniors have fallback documentation? Are prompts, decisions, and generated explanations captured in durable internal systems, or do they vanish into private chat windows like institutional memory with a logout button?

A practical dashboard might look like this:

Management question	Better metric	Why it fits the paper
Are juniors becoming productive faster?	Time to first accepted pull request; review cycles on first contributions	Captures the novice productivity mechanism
Are mid-level developers spreading knowledge?	Review participation, issue quality, discussion activity	Captures the intermediate knowledge-sharing mechanism
Are teams learning new technical domains?	New language/framework adoption in real projects	Captures the skill-acquisition mechanism
Are we resilient to AI outages?	Work paused or slowed during LLM downtime by team and role	Captures the interruption risk exposed by the ban
Are we mistaking activity for productivity?	Link commits and PRs to review quality, defects, and maintainability	Prevents output theatre, always a worthy civic project

The point is not to bury engineers under measurement. That would be an efficient way to make them hate both AI and management. The point is to stop pretending that commits alone describe the productivity system.

The boundary: public GitHub is not your entire engineering organisation

The study is strong because the natural experiment is unusually clean. It is bounded for the same reason.

GitHub OSS activity is observable, public, and structured. Enterprise software work includes private repositories, internal documentation, meetings, architecture reviews, security constraints, compliance workflows, customer escalations, and many tasks that never appear in public commits. The authors also rely on self-reported GitHub locations, which is common but not flawless.

The period matters too. The observation window sits in early 2023, when ChatGPT was prominent and the broader LLM tool ecosystem was much thinner. Today’s teams may use coding assistants, internal retrieval systems, IDE-integrated agents, model gateways, and private deployments. The “loss of ChatGPT” in 2023 is not identical to “loss of one model provider” in a mature enterprise AI stack.

Finally, skill acquisition is measured through new programming language use. That is concrete and useful, but it does not capture every form of learning. A developer may learn architecture, testing discipline, security reasoning, or debugging technique without using a new language. Conversely, using a new language once does not prove deep mastery. The metric is a proxy, not a diploma.

These boundaries do not weaken the paper’s core managerial lesson. They prevent lazy overextension. A rare and beautiful service.

The real conclusion: productivity is a system property

The phrase “AI productivity” often compresses a complicated operating system into a single fantasy number. This paper usefully decompresses it.

ChatGPT appears to support direct output for novices, collaborative knowledge exchange for intermediate developers, and practical learning in complex or fragmented technical domains. That is not the same as saying “LLMs make developers 6.4% better.” It is more precise: removing ChatGPT disrupted different layers of software work in different ways.

The business value is therefore not only faster code generation. It is lower onboarding friction, broader technical mobility, more accessible peer review, and resilience planning for a world where AI tools become part of the workbench. Treating LLMs as a nice-to-have coding widget misses the mechanism. Treating them as developer infrastructure is closer to the evidence.

The pull request still matters. It just no longer tells the whole story.

Cognaptus: Automate the Present, Incubate the Future.

Sardar Bonabi, Sarah Bana, Vijay Gurbaxani, and Tingting Nian, “Beyond Code: The Multidimensional Impacts of Large Language Models in Software Development,” arXiv:2506.22704, revised July 1, 2025, https://arxiv.org/abs/2506.22704. ↩︎

TL;DR for operators#

The office version of the experiment: turn off the assistant and see what breaks#

The natural experiment was not adoption; it was interruption#

The first mechanism: ChatGPT as a direct productivity layer#

The second mechanism: ChatGPT as a collaboration amplifier#

The third mechanism: ChatGPT as a learning scaffold#

Novices need the scaffold; intermediates use it to climb#

LLM-assisted learning matters most where documentation is fragmented or the terrain is hard#

The robustness checks support the main story, but they are not a second thesis#

What the paper shows, what Cognaptus infers, and what remains uncertain#

How operators should redesign the AI productivity dashboard#

The boundary: public GitHub is not your entire engineering organisation#

The real conclusion: productivity is a system property#