A chatbot with a name, a warmer tone, a few emojis, and a slightly irregular rhythm does not feel like a philosophical problem at first. It feels like product polish.
That is exactly why anthropomorphic AI is difficult to govern. The cues are small. A friendly name here, a follow-up question there, a little latency to imitate human typing, a softer apology, a more adaptive conversational style. None of these looks dramatic enough to trigger a board-level ethics review. Together, however, they move the system from “tool” toward “someone-like.”
The standard debate around this shift has become oddly symmetrical. Product teams often assume humanlike design is good because users like it, engagement rises, and the interface feels less cold. Safety discourse often assumes humanlike design is dangerous because users may overtrust the system, form unhealthy attachments, or become vulnerable to persuasion. Both sides have a point. Both sides also enjoy simplifying reality because reality, as usual, has poor slide-deck manners.
The paper Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally complicates the argument in the useful way: not by saying “humanlike AI is good” or “humanlike AI is bad,” but by showing that anthropomorphism, engagement, trust, and culture do not move as one clean bundle.1
The authors ran two cross-national experiments with 3,500 participants across ten countries. Participants interacted in real time with a GPT-4o chatbot in their native languages, in open-ended but non-sensitive conversations. The key result is simple enough to remember and inconvenient enough to matter: more humanlike design reliably made users perceive the chatbot as more humanlike, but it did not universally increase trust. Engagement rose in some aggregate measures. Trust did not. Country-level effects then pulled the story apart even further.
That matters for AI businesses because anthropomorphic design is not just a moral abstraction. It is a practical UX control surface. Names, tone, response timing, emojis, relationship-building prompts, follow-up questions, and adaptation to user style are design variables. The paper suggests these variables can be tested, tuned, and misused. It also suggests they cannot be governed by one universal slogan.
The product story says “humanlike means engaging”
Start with the commercial narrative, because it is the one most product teams quietly live by.
If a chatbot feels more human, people will use it more. The interface becomes less intimidating. The user writes more. The assistant can ask clarifying questions without sounding like a form. The customer support bot can apologize without sounding like a tax notice. The education bot can encourage the student. The wellness bot can sound patient. The productivity assistant can become a familiar presence rather than a command-line interface wearing a prettier coat.
The paper gives that intuition some support, but not a blank cheque.
In Study 1, participants had roughly four-minute conversations with a chatbot and then evaluated its human-likeness. Anthropomorphism was already high without the system pretending to be human. Nearly 68% of users perceived the chatbot as somewhat or completely humanlike. Other attributes went even higher: 90% rated it as intelligent, 78% as empathetic, and 75% as conscious.
That does not mean users literally developed a theory of machine consciousness after discussing food preferences. It means survey language can elicit strong human-trait attributions once a model responds coherently, fluently, and socially. This is one of the paper’s quieter but important points: users may endorse abstract humanlike traits in Likert questions, while explaining their experience in much more practical terms when asked openly.
When users described what made the chatbot feel humanlike, they did not mostly talk about souls, morality, or metaphysical consciousness. They talked about interaction. The most common cues were conversation flow, mentioned by 32.1% of participants; understanding the user’s perspective, 24.4%; response speed, 22.5%; and authenticity, 18.4%. Theory-heavy constructs such as intelligence and empathy appeared less often in open-ended comments, and dimensions such as soul, morality, consciousness, and warmth were mentioned by fewer than 0.5% of participants.
That is an important correction for both product and policy audiences. The user does not need to believe the system has an inner life to experience it as socially present. A chatbot can feel humanlike because it follows conversational rhythm, responds at the right level of specificity, picks up the user’s perspective, and avoids the sterile texture of machine output. In other words, anthropomorphism is often built from workflow friction and conversational texture, not from metaphysics.
For a business team, this makes anthropomorphic design measurable. You do not need to debate whether users think the bot has a soul. Please do not put that in the Q3 OKR. You can measure whether users respond more, stay longer, return more often, disclose more context, or perceive the system as warmer and more responsive after changes to conversational design.
The safety story says “humanlike means overtrust”
Now take the safety narrative. It is not foolish. Humanlike AI can plausibly increase over-reliance, emotional attachment, vulnerability to persuasion, and misplaced trust. This concern becomes especially serious in medical advice, financial decisions, legal guidance, companionship, or interactions with children and lonely users.
The problem is not that the worry is imaginary. The problem is that a lot of public reasoning treats the path as linear:
| Assumption | What the paper shows | Business interpretation |
|---|---|---|
| Make AI more humanlike | This can be done experimentally through design and conversational sociability | Human-likeness is a controllable UX variable |
| Users anthropomorphize more | Yes, especially in the combined high-humanlike condition | Manipulation works; this is not just branding fluff |
| Users therefore trust more | Not universally; pooled behavioral trust was statistically unchanged | Trust needs separate measurement, not moral guesswork |
| The same design works everywhere | No; country-level effects diverged | Global rollout requires local testing |
Study 2 is where the paper earns its keep. The authors used a 2×2 factorial design with two treatment factors: Design Characteristics and Conversational Sociability.
Design Characteristics covered the visible and interactional surface: a human-style emoji rather than a robot emoji, a localized human name, informal tone, colloquialisms and emojis, variable response length, and response timing that could mimic human conversational latency. Conversational Sociability covered the interpersonal behavior: warmth, empathy, social follow-up questions, relationship-building, and adaptation of the chatbot’s personality to the user’s style.
This setup is useful because it resembles the levers actual product teams can pull without retraining a foundation model. You can rename the assistant. You can change tone. You can alter latency. You can add follow-up behavior. You can make the system more relational. You can also overdo all of this and accidentally create the digital equivalent of a salesperson who says your name too often.
The manipulation worked. Compared with the least humanlike condition, the most humanlike chatbot version produced a significant increase in ratings of the chatbot as humanlike: $b = 0.386$, 95% CI $[0.251, 0.522]$, $t(2396) = 5.590$, $p < 0.001$. The single-factor treatments also had significant effects, though smaller: high Design Characteristics alone produced $b = 0.204$, and high Conversational Sociability alone produced $b = 0.181$.
But the same manipulation did not significantly change perceived intelligence, competence, or consciousness. This is where the paper becomes more interesting than the usual “AI feels human” discussion. The humanlike treatment changed social perception, not core capability perception. Users could feel that the system was more humanlike without concluding that it was smarter, more competent, or more conscious.
That distinction helps explain the trust result. In the pooled sample, the most humanlike AI did not produce a statistically significant increase in behavioral trust in an incentivized Trust Game. Participants were given points and could choose how much to allocate to the AI agent. The amount sent served as the behavioral trust measure. The difference between the most and least humanlike conditions was essentially zero: $t(1194) = 0.038$, $p = 0.969$, $d = 0.002$.
The authors did not merely fail to find significance and then declare victory, which is a surprisingly popular hobby in empirical work. They also used equivalence testing and Bayesian analysis. Equivalence testing suggested trust differences were statistically equivalent to zero at a smallest effect size of interest of Cohen’s $d = 0.20$, and Bayesian analysis provided strong evidence for the null across pairwise comparisons.
So the paper does not say humanlike AI cannot increase trust. It says that in this setting—text-based, non-sensitive, short-term, transparent interaction with one foundation model—making the chatbot more humanlike did not generally increase behavioral trust.
That is a narrower claim, and therefore a more useful one.
Engagement moved; trust did not
The comparison that should matter most to businesses is not “humanlike versus machine-like.” It is “engagement versus trust.”
In Study 2, humanlike design increased behavioral engagement. One reported example is average number of messages, where the treatment effect reached $t(1191) = 4.380$, $p < 0.001$, Cohen’s $d = 0.25$. The paper also notes a possible reciprocal verbosity loop: users wrote more, and the chatbot wrote more too, even though the AI was not directly prompted to become more verbose.
This is a subtle operational point. If anthropomorphic design increases conversational exchange, it may increase both user engagement and system cost. More turns mean more tokens, more latency exposure, more opportunities for failure, and more data governance questions. The engagement lift is not free. It arrives with infrastructure and risk surface attached.
Trust, meanwhile, did not rise in the pooled behavioral measure. That separates two business outcomes that product dashboards often blur together. A user who writes more is not necessarily a user who trusts more. A user who enjoys the interaction is not necessarily a user who will rely on the system in a consequential decision. A user who calls the chatbot “friendly” is not necessarily confused about its competence.
This distinction should change how AI teams instrument their products.
| Product metric | What it may capture | What it should not be assumed to capture |
|---|---|---|
| Longer conversations | Engagement, curiosity, social comfort, task complexity | Trust, reliance, accuracy judgment |
| More return usage | Habit, utility, entertainment, emotional fit | Safety, justified trust, user understanding |
| Higher warmth rating | Relational UX success | Competence perception |
| Higher perceived human-likeness | Anthropomorphic response to cues | Overtrust by default |
| More disclosure | Comfort or perceived rapport | Informed consent or low risk |
The practical message is not “stop making AI friendly.” It is “stop pretending friendliness, trust, and safety are the same metric.” A chatbot can be socially sticky without being trusted. It can be trusted without being warm. It can be warm in Brazil and annoying in Japan. Product analytics that collapse these distinctions into one “user satisfaction” score are not measuring the phenomenon. They are politely hiding it.
Culture breaks the universal rule
The strongest reason to use a comparison-based structure for this paper is that both obvious narratives fail at the same place: global variation.
Study 1 found two broad clusters in perceived human-likeness. Participants from Indonesia, Mexico, India, Nigeria, Egypt, and Brazil rated the AI as more humanlike on average ($M = 3.98$, $SD = 1.19$) than participants from the United States, Germany, Japan, and South Korea ($M = 3.29$, $SD = 1.25$). The authors also report an exploratory positive correlation between cultural distance from the United States and anthropomorphism, though with only ten country-level observations the result was not statistically significant ($r = 0.52$, $p = 0.127$).
That last caveat matters. It is tempting to turn the result into a clean cultural theory. Resist the temptation. Ten countries are enough to expose heterogeneity, not enough to build a civilization-level law of chatbot psychology. Still, the direction is commercially important: the same AI interface is not experienced as the same social object everywhere.
Study 2 sharpened this point with six countries and larger within-country samples. The combined high-humanlike treatment produced positive anthropomorphism effects in Brazil, Germany, and the United States, but not significantly in Egypt, India, and Japan. More importantly, downstream outcomes varied.
Brazilian participants showed positive effects of humanlike AI design on engagement, tendency to use AI again, seeing AI as a friend, and self-reported trust. Japanese participants in one humanlike condition, high Design Characteristics with low Conversational Sociability, showed lower tendency to use AI again, lower perception of the AI as a friend, and lower self-reported trust.
This is where a universal UX doctrine collapses. Humanlike cues are not universally charming. Nor are they universally dangerous. They are socially interpreted.
A Brazilian user may read warmth, informality, and relational effort as a better conversational fit. A Japanese user may read certain humanlike surface cues as performative, mismatched, or inauthentic. The paper suggests one possible interpretation: in cultural settings with more familiarity around humanlike machines or different expectations for social presence, attempted human-likeness may raise the bar rather than clear it. The result can resemble an uncanny valley effect in conversational design: not creepy robot face, but socially awkward chatbot theater.
This is especially relevant for global AI products that assume localization means translation plus regulatory paperwork. Translation changes language. It does not automatically translate social presence.
The paper’s evidence is strongest when read as a test map
The study is not one monolithic proof. It is better read as a set of tests with different evidentiary roles.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Study 1 cross-country survey | Main descriptive evidence | Anthropomorphism is common and varies across countries | Causal effects of design choices |
| Open-ended response coding | Mechanism discovery | Users focus on pragmatic interaction cues more than abstract traits | Exact universal taxonomy of anthropomorphism |
| LLM-assisted autorater | Implementation detail and scaling method | Enables coding of multilingual open-ended responses at scale | Perfect label accuracy or cultural nuance |
| Study 2 2×2 design | Main causal evidence | Humanlike design can experimentally increase anthropomorphism | Effects of all possible AI interfaces |
| Trust Game | Behavioral trust measure | Humanlike treatment did not increase pooled behavioral trust | Trust behavior in medical, financial, or crisis settings |
| Country subgroup estimates | Exploratory heterogeneity evidence | Direction and magnitude differ across countries | Definitive country-level policy rules |
This distinction matters because readers often mishandle empirical papers in one of two ways. The optimistic reader grabs the strongest causal result and overgeneralizes it. The skeptical reader finds one limitation and dismisses the whole paper. Both approaches are lazy in different outfits.
The useful reading is more disciplined. The paper provides strong evidence that anthropomorphic cues can be manipulated in realistic text-chat settings. It provides strong evidence that pooled trust did not rise in the tested behavioral trust game. It provides suggestive and practically important evidence of country-level heterogeneity. It does not prove how anthropomorphic AI behaves in therapy, financial advising, medical triage, voice companions, embodied robots, children’s products, or long-term relationships.
That is not a weakness so much as a boundary. Boundaries are where business decisions should become more precise.
What AI teams should infer, and what they should not
Here is the direct finding: in the tested setting, humanlike design increased perceived human-likeness, increased some forms of engagement, did not increase pooled behavioral trust, and produced heterogeneous country-level effects.
Here is the business inference: anthropomorphic UX should be treated as a segmented design variable, not a universal best practice or universal hazard.
That leads to several operational consequences.
First, teams should separate anthropomorphism testing from trust testing. A/B tests that only measure session length, satisfaction, or retention may reward humanlike design while missing whether users are becoming more reliant, more comfortable disclosing sensitive data, or more likely to accept bad advice. Conversely, a lack of overtrust in one general-purpose interaction does not prove safety in high-stakes contexts.
Second, global deployment should not use one “friendly assistant” configuration everywhere. Localization should include social-cue calibration: naming conventions, emoji use, response rhythm, apology style, follow-up frequency, informality, and degree of relational framing. These are not cosmetic details. They are part of how users infer what kind of social object they are interacting with.
Third, product teams should distinguish low-risk engagement design from high-risk reliance design. A cooking assistant that uses a warm tone is one thing. A financial-planning bot that uses intimacy cues while nudging a user toward risky decisions is another. The paper’s setting was mundane and non-sensitive; that is precisely why its results should not be lazily imported into high-stakes domains.
Fourth, governance teams should avoid blanket bans that sound principled but ignore interaction context. “Do not make AI humanlike” is clean, but not especially intelligent. A better policy asks: Which cues are being used? For which users? In which market? For which task? With what disclosure? With what measurement of trust, reliance, and harm?
The paper’s business lesson is not that anthropomorphism is safe. It is that anthropomorphism is conditional.
The quiet mechanism: capability still matters
One of the paper’s most useful interpretations is that human-likeness may act as a catalyst only when users also perceive competence and alignment.
In Study 2, the treatment changed humanlike perception but did not significantly change perceived intelligence, competence, or consciousness. That suggests users were not simply fooled into thinking the system was better at the task. They could experience the assistant as more socially humanlike while maintaining separate judgments about capability.
For enterprise AI, this is a useful distinction. Many business users do not trust AI because it is friendly. They trust it when it helps them complete work, produces verifiable outputs, aligns with their goals, and behaves predictably under pressure. Humanlike interaction can reduce friction, but it cannot substitute for competence. A charming hallucination is still a hallucination. It just has better bedside manner.
This also explains why humanlike AI design may be more dangerous in domains where competence is difficult for users to evaluate. In a casual food conversation, users can judge the bot’s usefulness easily enough. In legal, medical, tax, or investment contexts, the user may not know whether the answer is good. There, warmth and confidence could carry more weight because independent verification is harder. The paper does not test that case, but it helps define why that case deserves separate testing.
Boundaries before deployment
The paper’s limitations are not decorative disclaimers. They change how the results should be used.
The conversations were text-based and non-sensitive. Voice, avatars, and embodied agents may produce stronger anthropomorphic reactions. The interactions were short and measured immediate outcomes. Long-term exposure may change attachment, reliance, or emotional dependence. The system used GPT-4o, meaning the results may not generalize cleanly to weaker models, more specialized models, or systems with different default personalities. Country-level subgroup analysis was exploratory, with roughly 100 participants per treatment arm within each country, so it should guide further testing rather than define final country rules.
There is also a measurement boundary. The Trust Game is a useful behavioral measure, but it is still a simplified proxy. Trust in a game is not the same as relying on a chatbot for a medical decision at 2 a.m., asking it for relationship advice after a breakup, or using it to manage a trading portfolio while pretending leverage is a personality trait.
For business use, the right response is not to ignore the paper. It is to copy its discipline: test actual interactions, measure behavior rather than only attitudes, and segment results by user context.
From slogans to segmentation
The core contribution of this paper is not that users anthropomorphize AI. We already knew that people can treat machines socially. The contribution is sharper: in realistic multilingual chatbot interactions, humanlike design can causally increase anthropomorphism, but the downstream effects do not obey a universal script.
The product slogan says: make AI more human and users will engage.
The safety slogan says: make AI more human and users will overtrust.
The evidence says: humanlike cues work, but not always in the way either side wants. Engagement and trust split. Culture moderates interpretation. Surface warmth does not automatically change perceived competence. Some markets may welcome relational design; others may punish it as fake, excessive, or socially miscalibrated.
That is a more annoying conclusion than either slogan. It is also much closer to how global products actually fail.
For Cognaptus readers, the practical takeaway is straightforward: anthropomorphic AI should be governed like a high-impact UX parameter. It deserves experimentation, segmentation, and risk controls. It should not be optimized blindly for retention. It should not be banned blindly for virtue. The real question is not whether AI should be humanlike. The real question is where, for whom, for what task, and with which measured consequences.
Too human, too soon? Sometimes. In some markets. In some contexts. For some users. With some cues.
A terrible slogan. A much better product strategy.
Cognaptus: Automate the Present, Incubate the Future.
-
Robin Schimmelpfennig, Mark Diaz, Vinodkumar Prabhakaran, and Aida Davani, “Humanlike AI Design Increases Anthropomorphism but Yields Divergent Outcomes on Engagement and Trust Globally,” arXiv:2512.17898, 2026, https://arxiv.org/abs/2512.17898. ↩︎