Mind the Gap: How AI Papers Misuse Psychology

TL;DR for operators

AI teams love borrowing psychology. It gives messy model behaviour a tidy name: “reasoning,” “empathy,” “Theory of Mind,” “bias,” “motivation,” “attention.” The problem is that a borrowed label is not the same as a valid construct. A new paper, The Incomplete Bridge: How AI Research (Mis)Engages with Psychology, studies this borrowing directly by mapping 1,006 LLM-related papers from major AI venues and the 2,544 psychology papers they cite.¹

The operational message is blunt: more psychology citations do not automatically mean better grounding. Sometimes they mean the opposite: a thin layer of academic legitimacy spread over a benchmark, prompt technique, product claim, or evaluation protocol. Very scholarly. Very clickable. Not always very sound.

The paper finds three patterns worth translating into business practice.

First, psychology has become increasingly visible in LLM research since 2023. The authors trace this through citation flows across top AI venues, finding that psychology references begin rising around the GPT-3.5/GPT-4 period, accelerate later in 2023, and then slow in growth by mid-to-late 2024. This is main mapping evidence, not causal proof that specific model releases caused the trend.

Second, the citations are uneven. LLM research draws most heavily on Psychometrics & Judgment and Decision-Making, Neural Mechanisms, Language, and Social Cognition. Education and Social-Clinical psychology appear comparatively underused. That matters because many commercial AI products are being sold precisely in education, mental health, coaching, employee support, customer service, and social interaction—domains where weak psychology can become weak product design.

Third, the paper’s Theory of Mind case study shows how interdisciplinary borrowing can go wrong. It identifies four recurring failure modes: overgeneralising concepts, citing only a narrow slice of the literature, misrepresenting findings, and building closed citation loops inside AI research instead of returning to primary psychology sources. For operators, these are not academic etiquette problems. They are product-risk problems.

The practical checklist is simple:

Operator question	Why it matters
Are we using a psychology term as a scientific construct or as decorative vocabulary?	Decorative vocabulary creates false confidence.
Does the cited literature actually measure the capability we claim to measure?	A benchmark can be precise and still measure the wrong thing.
Are we citing primary psychology research or only AI papers summarising it?	Secondary citation loops amplify early misunderstandings.
Have domain experts helped define the construct before implementation?	Late expert review often becomes reputation management, not design.
Are our claims limited to task performance, or do they imply human-like mental capacity?	The gap between “passes a task” and “has the capacity” is where the trouble lives.

The paper does not show that psychology is useless for AI. Quite the opposite. It shows that psychology is useful enough to be dangerous when used carelessly.

The business problem is not citation count. It is construct debt.

A product team building an AI tutor might say the system supports “metacognition.” A mental-health chatbot might claim to use “CBT-informed” dialogue. A sales assistant might be described as having “Theory of Mind” because it adapts to customer intent. A culture-aware assistant might be evaluated through “moral reasoning” or “empathy” tasks.

These phrases sound useful because they come from real disciplines. They also travel badly.

A psychological construct is not just a label. It comes with assumptions, measurement traditions, boundary conditions, populations, task designs, and debates. When AI research imports only the label, it creates what we might call construct debt: the hidden cost of building systems, benchmarks, or claims on terms whose meaning has not been paid for.

The paper’s central contribution is to make this debt visible. It does not run a new LLM benchmark. It does not claim that one model has better or worse psychology-grounded reasoning than another. Instead, it asks a science-of-science question: how has LLM research been citing and operationalising psychology?

That distinction matters. A model-performance paper asks, “Can the system do X?” This paper asks, “When researchers say X, what intellectual machinery are they borrowing, and are they borrowing it responsibly?”

For commercial AI teams, the second question is often more important than it looks. If a benchmark is based on a weakly imported construct, then improving the score may improve the benchmark result without improving the real product capability. That is how organisations end up optimising dashboard numbers that have been wearing a lab coat.

What the paper actually does: a map, not a model leaderboard

The authors start with 25,843 papers from seven major AI and NLP venues: NeurIPS, ICLR, ICML, ACL, EMNLP, NAACL, and TACL. They focus on work from 2023 and 2024, plus a small 2025 TACL slice, and filter for papers with “LLM” or “language model” in the title or abstract. That narrows the set to 3,962 LLM-related papers.

Then comes the psychology filter. Using Semantic Scholar field classifications, the authors identify references tagged as Psychology but not Computer Science. They keep only LLM papers that cite at least one such psychology reference. The final corpus becomes 1,006 LLM papers and 2,544 cited psychology papers.

The analysis then proceeds in three layers.

First, the authors embed paper titles and abstracts using SPECTER, a citation-informed document embedding model, and cluster the LLM papers and psychology papers separately with K-means. The selected structure yields eight LLM clusters and six psychology clusters.

Second, they map citation flows between those clusters. This is the paper’s main evidence for how psychology enters LLM research.

Third, they examine psychology theories and frameworks more granularly. Domain experts help label secondary psychology clusters and identify prominent theories or frameworks. GPT-4.1 is then used to connect specific papers to those theories based on titles and abstracts. The result is not a definitive taxonomy of psychology. It is a structured map of what this slice of LLM research appears to cite.

That methodology is useful, but it should not be overread. Clustering is a lens, not an oracle. SPECTER embeddings, K-means, GPT-assisted labelling, and Semantic Scholar classifications all shape the map. The authors are not claiming to have discovered the true ontology of psychology. They are showing a defensible picture of citation behaviour in a defined corpus.

Paper element	Likely purpose	What it supports	What it does not prove
Corpus construction from major AI venues	Main sampling frame	Shows how psychology appears in selected top-venue LLM research	Does not cover all AI work, industry research, non-English research, or HCI-heavy venues
SPECTER embeddings and K-means clustering	Implementation detail for mapping	Creates thematic groups for LLM and psychology papers	Does not establish natural or immutable disciplinary categories
Bipartite citation network	Main evidence	Shows which LLM clusters cite which psychology clusters	Does not judge whether citations are accurate
Temporal citation-flow figures	Main evidence with trend interpretation	Shows psychology citation growth over time	Does not prove that specific model releases caused the growth
Theory/framework tables	Exploratory extension and gap analysis	Identifies commonly cited and underexplored psychology frameworks	Does not prove the listed underexplored theories will improve models
Theory of Mind case study	Diagnostic qualitative analysis	Illustrates how misapplication happens in practice	Does not estimate the prevalence of every misuse across all psychology citations

This distinction prevents a common misreading: the paper is not saying “AI research cites psychology badly everywhere.” It is saying the field is building an increasingly busy bridge to psychology, and parts of that bridge are made of cardboard.

Category one: measurement psychology is popular because AI needs scorekeeping

The most cited psychology cluster is Psychometrics & Judgment and Decision-Making. This is not surprising. AI research has a measurement addiction, and psychometrics is, at its best, the discipline of disciplined measurement.

Classical Test Theory, Item Response Theory, Likert scales, inter-rater reliability, and related methods fit naturally into LLM evaluation. When researchers build benchmarks, compare annotators, measure bias, assess reasoning, or analyse preference data, psychometrics offers ready-made tools. In the paper’s top-ten theory/framework list, Dual-Process Theories receive 434 citations, Heuristics and Biases 210, and Classical Test Theory 145. Those numbers are not ornamental; they reveal what LLM research most urgently wants from psychology: ways to measure behaviour and talk about judgment under uncertainty.

For operators, this category is the most immediately useful. If you are evaluating an AI customer-support assistant, tutor, analyst, or internal copilot, you are already doing psychometrics whether you admit it or not. You are defining constructs, writing items, rating outputs, sampling cases, calibrating annotators, and turning messy behaviour into scores.

The paper’s implication is not “use more psychometrics.” It is “stop using psychometric-looking rituals without psychometric discipline.”

A customer satisfaction score is not a measure of empathy. A human preference vote is not automatically a measure of helpfulness. A pass rate on a reasoning benchmark is not necessarily evidence of reasoning ability. A synthetic user simulation is not a validated population sample. The problem is not that metrics are fake. The problem is that metrics inherit the assumptions of their construction, and those assumptions often remain conveniently offstage.

The business risk is straightforward: poorly grounded measurement creates false product maturity. A team may think it has improved safety, trust, empathy, or reasoning because the score moved. But if the construct mapping is weak, the score is only evidence that the system has adapted to the measurement surface.

This is how benchmarks become theatre. Expensive theatre, naturally.

Category two: neural and cognitive mechanisms are attractive because they make black boxes feel less black

The second major citation magnet is Neural Mechanisms. The paper shows that LLM research often borrows from neuroscience and cognitive psychology when trying to explain reasoning, memory, adaptation, and learning. The appeal is obvious. LLMs are opaque systems. Psychology and neuroscience have long dealt with opaque systems called humans. A certain amount of borrowing is not only reasonable; it is historically baked into AI.

The problem is level confusion.

A neural mechanism in humans is not the same thing as a transformer mechanism in an LLM. A theory about working memory, executive function, or Theory of Mind may inspire hypotheses about model behaviour, but it does not automatically map onto attention heads, hidden states, context windows, or agent modules.

The paper is careful here. It does not mock analogy. Analogy is often productive. But analogy becomes fragile when it changes level without telling the reader. For example, a study about brain-region activation during mental-state reasoning may be relevant to understanding human Theory of Mind. It does not directly justify a claim about a chatbot’s social interaction strategy. The same phrase—“Theory of Mind”—can refer to behavioural tasks, developmental ability, neural substrates, social reasoning, or model evaluation. Those are related, but not interchangeable.

For business teams, the lesson is to separate inspiration from validation.

It is legitimate to say: “This product design is inspired by research on scaffolding, executive function, or perspective-taking.” It is much stronger—and much riskier—to say: “This system has metacognition,” “this agent has Theory of Mind,” or “this architecture implements human-like cognitive control.”

The first is a design influence. The second is a capability claim. Investors, regulators, clinicians, educators, and enterprise buyers do not always hear the difference. Product teams need to.

The paper’s Language and Social Cognition clusters are especially relevant for applied AI. They include frameworks such as Connectionism vs. Symbolism, Usage-Based Models of Language, Schema Theory, Theory of Mind, Simulation Theory, and Dual-Process Theory. These are natural magnets for LLM research because LLMs produce language and increasingly mediate social interaction.

The commercial temptation is to move too quickly from output fluency to psychological competence.

A model can generate empathetic language without possessing empathy. It can describe another person’s belief without having Theory of Mind in the human developmental sense. It can produce a moral justification without sharing the human architecture of moral reasoning. This is not a philosophical nitpick. It changes how systems should be tested.

If the product requirement is “write a comforting response,” then output evaluation may be enough. If the requirement is “detect user distress and respond safely,” then the construct is broader. It involves context, uncertainty, risk, escalation boundaries, population differences, and failure modes. If the product claim becomes “AI therapist,” the construct debt compounds quickly.

The paper’s Theory of Mind case study is useful because ToM is one of the most seductive labels in this category. It sounds precise. It has classic tasks. It maps neatly onto AI agents. It also comes with enough conceptual complexity to punish lazy borrowing.

The authors show that LLM research cites ToM papers from at least two different psychology orientations. Social Cognition papers often provide behavioural tasks and social-reasoning paradigms, such as false-belief tasks. Neural Mechanisms papers focus more on biological substrates, such as brain regions implicated in mental-state reasoning. Both can be relevant. Neither should be casually swapped for the other.

The business translation: if your product requires social reasoning, define the construct at the level of use. Are you testing whether the model can track a user’s stated goal? Infer a hidden preference? Handle deception? Maintain another agent’s belief across turns? Recognise emotional change? Avoid manipulative persuasion? These are not the same capability. Calling them all “Theory of Mind” makes the roadmap look elegant and the evaluation plan weaker.

One of the paper’s more commercially important findings is that Education and Social-Clinical psychology are less frequently cited than psychometrics, neural mechanisms, language, and social cognition. The authors offer plausible reasons: these domains often require long-term human feedback, sensitive data, strict ethics, and HCI-style methods that fall outside the surveyed venues.

That boundary is important. The finding does not prove that education and clinical psychology are ignored across the whole AI ecosystem. It shows they are comparatively less visible in this specific top-venue LLM citation map.

Still, for operators, this underuse is uncomfortable. Education and mental-health products are among the most obvious markets for LLM applications. They are also domains where shallow psychological grounding can cause direct harm.

Consider education. An AI tutor that claims to support learning should care about more than answer correctness. It should consider scaffolding, motivation, prior knowledge, metacognition, reading comprehension, feedback timing, developmental context, and classroom realities. The paper notes that frameworks such as Bloom’s Taxonomy appear in LLM evaluation and benchmark design, while theories such as Self-Determination Theory, Bronfenbrenner’s Ecological Systems Theory, and the Simple View of Reading are identified as comparatively underexplored opportunities.

The key word is “opportunities,” not “magic ingredients.” A theory does not improve a product by being mentioned in the pitch deck. It improves a product only if it changes task design, interaction design, measurement, or deployment constraints.

The same applies to mental health. The paper identifies Cognitive Behavioural Therapy, Goffman’s Theory of Stigma, and the DSM as prominent within the Social-Clinical cluster. These frameworks can inform mental-health chatbots, stigma analysis, virtual patient simulation, and clinical-support tools. But each also carries serious boundaries. CBT is not a generic “be supportive” template. DSM categories are not casual labels for model-generated diagnosis. Stigma theory is not a keyword list for toxicity detection.

This is where the article’s title earns its keep: mind the gap. The gap is not between AI and psychology as disciplines. The gap is between a product claim and the construct that supposedly supports it.

The four misuse patterns should become an AI due-diligence checklist

The paper’s Theory of Mind case study identifies four misuse patterns. They are framed academically, but they translate cleanly into operational risk controls.

Misuse pattern	What it looks like in research	What it looks like in product work	Practical audit question
Conceptual overgeneralisation and misclassification	Treating different ToM tasks or social-cognition processes as the same thing	Calling many forms of user modelling “empathy,” “intent understanding,” or “Theory of Mind”	What exact capability are we claiming, and which tasks actually measure it?
Partial or incomplete citation	Relying on a few classic studies while ignoring more relevant or contested work	Building around familiar frameworks because they are pitch-friendly	Which less famous but more applicable studies would change the design?
Misinterpretation or misrepresentation	Using a paper because it shares a topic label, not because it supports the argument	Turning psychological caveats into product certainty	What does the cited evidence not support?
Secondary citation errors	Citing AI papers that summarise psychology instead of the original psychology literature	Letting one early benchmark define the company’s evaluation language	Have we traced the claim back to primary sources?

These are not abstract sins. Each produces a predictable business failure.

Overgeneralisation produces vague requirements. Incomplete citation produces narrow design. Misrepresentation produces inflated claims. Secondary citation loops produce inherited errors that become harder to challenge once they are embedded in benchmarks, documentation, investor decks, or procurement materials.

The phrase “consensus of misreading” is especially useful here. In a fast-moving field, one influential interpretation can be copied by later papers, benchmark suites, product teams, and blog posts until it feels established. Nobody needs to be dishonest. They only need to be busy.

AI teams are, famously, never busy. So no issue there.

The category map: useful, underused, and dangerous borrowing

A category-based reading of the paper is more useful than a linear summary because the practical problem is not simply “AI cites psychology.” The problem is that different kinds of borrowing have different risk profiles.

Psychology category	Why LLM research borrows it	Business value	Main danger
Psychometrics & Judgment/Decision-Making	Evaluation, annotation, uncertainty, bias, reasoning scores	Better benchmarks, more disciplined measurement	Treating scores as construct validity
Neural & Cognitive Mechanisms	Explanations of reasoning, memory, planning, adaptation	Inspiration for architectures and interpretability hypotheses	Confusing analogy with mechanism
Language & Psycholinguistics	LLMs are language systems; language theories feel directly relevant	Better evaluation of comprehension, pragmatics, multilingual behaviour	Equating fluent generation with human-like understanding
Social Cognition	Agents need to model intent, belief, emotion, persuasion, morality	Safer social agents and better human-AI interaction	Overclaiming empathy or Theory of Mind
Education	Tutors, assessment, learning support, feedback design	More effective AI learning products	Reducing learning to content delivery and quiz performance
Social-Clinical	Mental-health support, stigma, risk detection, therapeutic dialogue	Safer support tools and clearer escalation boundaries	Turning clinical constructs into chatbot branding

This table is where the paper’s business relevance sits. The categories do not merely describe academic clusters. They suggest where product teams should apply different levels of scrutiny.

If a framework is being used for internal inspiration, the bar is moderate: be clear, avoid overclaiming, test empirically. If it is being used for external product claims, the bar rises: define the construct, cite primary literature, involve experts, document boundaries. If it is being used in sensitive domains such as mental health, education, employment, or public services, the bar should be higher still: validate in context, monitor harms, and avoid pretending that a benchmark result is a professional credential.

What operators should infer—and what they should not

The paper directly shows that psychology citations in selected LLM research have grown, that citation patterns are uneven across psychology domains, and that Theory of Mind provides a concrete example of how interdisciplinary borrowing can become conceptually sloppy.

Cognaptus infers a broader operational lesson: AI teams should treat psychology-grounded claims as design liabilities until validated. Not liabilities in the sense of “avoid them.” Liabilities in the accounting sense: obligations that must be paid down through evidence.

A useful internal review process would ask four questions before any psychology-derived term appears in a product requirement, evaluation report, benchmark name, or marketing claim.

First: what is the construct? Define it without relying on the impressive word itself. “Empathy” is not a definition. “The model recognises user distress and responds with validated support strategies while avoiding diagnosis and escalation failures” is closer to a product requirement.

Second: what is the evidence chain? Identify primary sources, not only AI papers that cite other AI papers. If the team cannot explain why the original psychology literature supports the use case, the claim is not ready.

Third: what is the operationalisation? Specify which task, rating instrument, behavioural trace, or outcome measure connects the construct to the system. If the construct is broad, one metric will rarely be enough.

Fourth: what is the boundary? State where the system should not be interpreted as having the human capability. A chatbot passing a false-belief-style prompt does not settle whether it has human-like mental-state reasoning. It settles whether it handled that prompt under those conditions.

Layer	Paper directly shows	Cognaptus business inference	Still uncertain
Citation growth	LLM papers increasingly cite psychology papers in the surveyed venues	Psychology-grounded language will increasingly shape AI evaluation and product claims	Whether citation growth reflects deeper collaboration or surface-level borrowing
Cluster concentration	Citations concentrate around psychometrics/JDM, neural mechanisms, language, and social cognition	AI teams are prioritising measurement, mechanism analogies, and social-behaviour labels	Whether underused domains are absent or simply outside the sampled venues
Theory of Mind misuse	ToM citations show overgeneralisation, incomplete citation, misrepresentation, and secondary loops	High-value social-agent claims need stricter construct audits	How common the same misuse rates are across all psychology constructs
Recommendations	The authors call for theoretical accountability, construct operationalisation, collaborative parity, and open infrastructure	Product governance should include psychology review for sensitive use cases	What review model is most cost-effective across industries

This framing prevents the paper from becoming either a scolding sermon or an empty “interdisciplinary collaboration is good” poster. The point is narrower and more useful: psychology can improve AI systems when it changes design and measurement. It becomes theatre when it merely changes vocabulary.

The limits are real, but they do not dissolve the warning

The paper’s limitations matter because this is a mapping study, not an audit of all AI practice.

The corpus is restricted to selected top AI and NLP venues, mostly English-language research, within a short period from late 2022 to March 2025. HCI venues, clinical informatics, education technology, non-English scholarship, and industry-only product work may be underrepresented. That particularly affects conclusions about Education and Social-Clinical psychology, because much serious work in those areas may appear outside the surveyed AI conference pipeline.

The data pipeline also depends on database classification, citation extraction, embeddings, clustering, and GPT-assisted labelling. Each step can introduce noise. Some psychology references may be missed. Some papers may be grouped imperfectly. Some theories may be more or less visible depending on titles and abstracts.

These limits do not make the paper weak. They make it properly bounded.

The right reading is not: “This paper proves AI researchers misuse psychology everywhere.” The right reading is: “Within a defined and influential slice of LLM research, psychology is being increasingly cited, unevenly distributed, and sometimes operationalised in ways that create conceptual risk.”

For a business reader, that is enough. Product teams do not need universal proof before improving their due diligence. They need plausible evidence that a common practice can create avoidable risk. This paper supplies that.

The practical standard: cite less like tourists

The authors recommend theoretical accountability, better construct operationalisation, collaborative parity, and open interdisciplinary infrastructure. For operators, those can sound like academic governance phrases. They become more useful when translated into product standards.

Theoretical accountability means no psychology term should enter a roadmap without its assumptions and limits. “CBT-informed” should mean which CBT principles, for which interaction type, under which safety constraints—not “the bot says reframing things.”

Construct operationalisation means evaluation must connect the claimed capability to appropriate evidence. If a system claims to support learning, test learning outcomes, not merely user satisfaction. If it claims to detect distress, test distress detection and escalation safety, not just warmth of tone.

Collaborative parity means psychologists, educators, clinicians, or domain experts should shape the research question early. Bringing them in after the demo is built is not collaboration. It is laundering.

Open interdisciplinary infrastructure means reusable construct maps, measurement templates, benchmark documentation, and citation guides. This would reduce repeated reinvention and slow the spread of secondary citation errors.

The immediate version for an AI company is a construct review memo. It does not need to be grand. It needs to answer:

What psychological construct are we invoking?
Which primary sources define it?
Which parts of the construct are relevant to our use case?
Which parts are not relevant?
How do we measure it?
Who reviewed the mapping?
What claims are prohibited because the evidence does not support them?

That last question may be the most valuable. Good governance is not only deciding what the company can say. It is deciding what the company should stop itself from saying before someone in marketing discovers adjectives.

Conclusion: psychology is not seasoning

The paper’s strongest contribution is not its citation count. It is the taxonomy of interdisciplinary failure.

AI research increasingly reaches for psychology because LLMs increasingly behave like social, linguistic, adaptive systems. That impulse is reasonable. Psychology offers decades of work on measurement, learning, reasoning, emotion, development, social interaction, bias, mental health, and intervention. Ignoring that literature would be foolish.

But using it badly is not much better.

The trap is to treat psychology as seasoning: sprinkle on a little Theory of Mind, a little dual-process reasoning, a little empathy, a little CBT, and suddenly a technical system tastes more human-centred. The paper shows why that does not work. Constructs have histories. Tasks have boundaries. Evidence has levels. Citations have direction. Secondary summaries lose detail. Classic papers are not always the most relevant papers.

For operators, the lesson is practical. Before turning a psychology concept into a product feature, benchmark, sales claim, or safety assurance, ask whether the company has actually imported the theory—or merely imported the word.

The difference is expensive. As usual, the expensive part arrives after the demo.

Cognaptus: Automate the Present, Incubate the Future.

Han Jiang, Pengda Wang, Xiaoyuan Yi, Xing Xie, and Ziang Xiao, “The Incomplete Bridge: How AI Research (Mis)Engages with Psychology,” arXiv:2507.22847, 2025. https://arxiv.org/abs/2507.22847 ↩︎

TL;DR for operators#

The business problem is not citation count. It is construct debt.#

What the paper actually does: a map, not a model leaderboard#

Category one: measurement psychology is popular because AI needs scorekeeping#

Category two: neural and cognitive mechanisms are attractive because they make black boxes feel less black#

Category three: language and social cognition are where useful borrowing becomes easiest to oversell#

Category four: education and social-clinical psychology are underused where products most need them#

The four misuse patterns should become an AI due-diligence checklist#

The category map: useful, underused, and dangerous borrowing#

What operators should infer—and what they should not#

The limits are real, but they do not dissolve the warning#

The practical standard: cite less like tourists#

Conclusion: psychology is not seasoning#