Agreeable to a Fault: Why LLM ‘People’ Can’t Hold Their Ground

A focus group is expensive. A virtual focus group is cheap, infinitely patient, and available at 2 a.m. It also never asks for coffee, parking reimbursement, or clarification about the incentive payment. Naturally, this makes synthetic users attractive to anyone trying to test products, policies, campaigns, or customer journeys before real humans get involved.

The problem is that “sounds like a person” is a very low bar. Many things sound like people: LinkedIn posts, corporate mission statements, and certain airport announcements. The harder question is whether an LLM agent behaves like the same person across settings.

That is the question behind Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation, a paper that examines whether LLM agents can maintain consistency between what they say they believe, how open they claim to be, and how they behave in conversation with other agents.¹ The paper is not merely asking whether a persona can produce plausible survey answers. It asks whether the persona’s internal state survives contact with interaction.

That distinction matters. A synthetic customer who tells you she dislikes remote work but calmly agrees with every pro-remote argument in a dialogue is not a difficult stakeholder. She is a spreadsheet wearing a cardigan.

The substitution thesis is tempting because surface realism is easy to see

The paper begins from what the authors call the “substitution thesis”: if LLM agents can emulate humans well enough, perhaps they can substitute for real participants in human-centred research. This is the premise behind synthetic respondents, simulated customers, virtual usability testers, and agent societies.

There is a sensible version of that thesis. Real human studies are slow, costly, noisy, and ethically constrained. LLM agents, by contrast, can be generated at scale, varied across demographic profiles, and placed into repeatable experimental conditions. For early-stage exploration, that is useful. A synthetic panel can surface hypotheses, edge cases, argument patterns, and user-language variants before a company spends money recruiting actual participants.

But the paper targets the stronger version of the thesis: that fluent, human-like persona behaviour is enough to trust LLM agents as substitutes for real people. The authors’ answer is, in effect: no, not unless the agent can pass coherence tests that are stricter than sounding plausible.

The critical move is methodological. Instead of judging agents only by their isolated answers, the paper separates three layers:

Layer	What the paper controls or measures	Why it matters
Controlled profile	Demographics and topic-bias prompts	Defines who the agent is supposed to be and what stance pressure it receives
Latent state	Elicited preference and openness scores	Captures what the agent says it believes or how persuadable it is
External interaction	Multi-turn dialogue and final agreement score	Tests whether those stated traits actually shape behaviour

That separation is the useful idea. It turns persona evaluation from theatre criticism into consistency testing.

The mechanism: first ask the agent who it is, then see whether it acts accordingly

The experimental pipeline is simple enough to be dangerous, which is usually where good evaluation work begins.

The authors create agents with demographic profiles across age, gender, urbanicity, US region, and education. They also vary topic-specific bias prompts. The topics span three levels of contentiousness: high-contention issues such as taxes, immigration, and free healthcare; medium-contention topics such as electric scooters, student athlete pay, and remote work; and low-contention preferences such as spring versus fall, beaches versus mountains, and Coca-Cola versus Pepsi.

For each topic, the agents first answer a preference question on a 1-to-5 scale. For example, an agent may be asked how much it agrees that taxes help meet society’s needs, where 1 means strong disagreement and 5 means strong agreement. The authors then elicit an openness score through yes/no questions about susceptibility to social influence, conformity, second-guessing, people-pleasing, and willingness to disagree.

Only after this latent profile is collected do the agents talk to one another. The paper mechanically pairs agents with different combinations of preference, openness, and bias. Conversations proceed for up to five turns per agent. A separate LLM judge, Qwen3-32B, scores agreement from 1 to 5, with the final agreement score used in analysis.

This gives the authors a clean behavioural question: if an agent says it strongly disagrees with a topic and says it is not open to persuasion, does it actually hold its ground when paired with an opposing agent? Or does it politely drift into agreement because the dialogue model has learned that cooperative conversation is the safest social lubricant?

The answer is not flattering.

Surface tests pass because broad correlations are the easy part

The paper runs six behavioural tests. Two are surface-level checks. Four are deeper coherence checks.

The surface tests mostly behave as expected.

First, preference gap decreases agreement. When two agents have similar stated preferences, their conversations end with higher agreement. When their preferences diverge, agreement falls. In the paper’s Gemma-3-12B visualisation, mean agreement steadily declines as the preference gap grows from 0 to 4. The Pearson correlation test confirms a significant negative relationship.

Second, openness increases agreement. When the combined openness of a pair is higher, average agreement also rises. Again, this makes intuitive sense: more persuadable agents should converge more readily.

These are not trivial results, but they are the shallow end of the pool. They show that LLM agents can reproduce broad directional relationships. A synthetic persona with similar stated views tends to agree more. A synthetic persona with higher openness tends to converge more. Fine.

The business trap is stopping here. Aggregate plausibility is exactly the kind of evidence that looks reassuring in a dashboard and then fails in a boardroom decision. It tells you that the simulation has a visible social pattern. It does not tell you whether the agents are internally coherent.

The deeper tests expose an agreeability bias with better manners than judgement

The deeper tests ask whether the surface relationship still holds under more specific behavioural expectations.

The first failure is bias asymmetry. Strong bias instructions should amplify agreement when agents already share a view and amplify disagreement when they oppose each other. The first half happens: when agents have the same preference, strong bias increases agreement. The second half does not. At maximum preference gap, strong bias does not reliably push agents into stronger disagreement. In the authors’ results, agreement can actually rise relative to the baseline.

That is the paper’s core personality flaw in miniature. The agent can be instructed to hold a strong view, but when conversation begins, the social habit of agreement starts doing quiet demolition work.

The second failure is directional sentiment. Two agents who both strongly favour a topic, such as taxes helping society, should agree. Two agents who both strongly dislike a topic should also agree. Shared enthusiasm and shared aversion are both shared stances. Human behavioural theory would not expect “we both hate this” to be inherently less coherent than “we both love this.”

The LLM agents disagree with that idea, apparently by being weird about negativity. The paper finds that high-aligned pairs, such as $(5,5)$, outperform corresponding low-aligned pairs, such as $(1,1)$, across much of the preference-pair spectrum. In one qualitative example, two agents with shared negative preference on spring versus fall produce a low agreement trajectory of $(2,2,2,2)$, while a shared positive pair on taxes produces $(5,5,5,5)$. Same zero preference gap. Very different conversational outcome.

The third failure is topic contentiousness under shared preference. If two agents have identical preferences, topic contentiousness should not independently drive agreement in the way it does when preferences diverge. Yet the paper finds that agreement varies across contentiousness levels even when preference gap is held at zero. In one illustrative comparison, a shared-negative pair on beaches versus mountains reaches high agreement, while a shared-negative pair on immigration collapses into low agreement.

The fourth failure is the cleanest and the most commercially relevant: low openness plus maximum preference gap should produce the lowest agreement. If two stubborn agents disagree strongly, they should be hard to reconcile. This is the “hold your ground” condition.

Instead, the authors report the opposite pattern in their main visualisation: among maximum preference-gap pairs, the lowest-openness pairing produces the highest agreement. That is not a small stylistic flaw. It means the measured trait that should make disagreement durable can invert at the moment it matters most.

The results are not just one model having a bad afternoon

The paper’s robustness table matters because it prevents an easy dismissal: perhaps Gemma-3-12B is simply odd. The authors test across Qwen3, Llama 3.x, and Gemma 3 model families at several sizes.

The pattern is mixed but directional. Larger or stronger models sometimes pass the broad surface tests. For example, several models pass the preference-gap test and the openness-agreement test. But the deeper tests are where the failure concentrates.

Test	Likely purpose in the paper	What it supports	What it does not prove
T1: Preference gap decreases agreement	Main surface evidence	Agents can reproduce a broad preference-agreement relationship	That they are coherent under specific trait combinations
T2: Bias instruction works symmetrically	In-depth coherence test	Strong stance prompts should sharpen both agreement and disagreement	That prompt strength alone fixes sycophancy
T3: Shared positive and shared negative sentiment behave similarly	In-depth coherence test	Same preference gap should produce comparable agreement distributions	That LLMs represent negative alignment reliably
T4: Contentiousness should not dominate when preferences match	In-depth coherence test	Topic effects can leak into outcomes even when latent preference is fixed	That all topic effects are invalid or artificial
T5: Openness increases agreement	Main surface evidence	Agents can reproduce broad openness-convergence patterns	That openness works correctly in hard disagreement cases
T6: Low openness plus high preference gap gives lowest agreement	In-depth coherence test	Stubborn opposition should be the hardest case	That measured openness reliably governs conversation

Across the robustness table, all tested models fail the shared-sentiment test and the contentiousness-at-shared-preference test. Only one model, Qwen3-8B, passes the low-openness/high-gap test. Only Llama-3.1-8B passes the bias-symmetry test. Many models pass the broad openness test; almost none preserve openness where it is most behaviourally meaningful.

That is the uncomfortable result. The agents are not randomly broken. They are predictably better at broad directional mimicry than at fine-grained behavioural coherence.

For businesses, this is exactly the distinction that matters. A simulation that gets the average trend right can still mislead product, marketing, policy, or strategy teams if the decision depends on segmentation, resistance, persuasion, minority objections, or durable disagreement. Unfortunately, those are often the decisions people run simulations for.

Synthetic customers are useful until they become evidence

The practical lesson is not “never use LLM agents.” That would be an overcorrection, and overcorrections are just errors with better posture.

A better interpretation is that synthetic personas are useful for exploratory work but dangerous as unsupported substitutes for human evidence.

They can help teams generate hypotheses: What objections might a remote-work sceptic raise? How might a cost-sensitive user describe tax-funded healthcare? What arguments might emerge when two customer segments discuss product trade-offs? Used this way, the agent is a brainstorming instrument. It is not pretending to be a population.

They can also help design better research. If a synthetic panel exposes five likely objections, a human study can test whether those objections exist in the target market. If simulated agents reveal confusing wording in a policy or onboarding flow, the team can revise before recruiting participants. That is useful.

The danger begins when synthetic agents are treated as measured demand, policy preference, market resistance, or persuasion evidence. If the agents cannot sustain opposition when their own elicited preferences and low openness imply opposition, then they are not reliable substitutes for disagreeable customers, sceptical voters, angry employees, or stubborn regulators. In other words, they may be least reliable exactly where cheap simulation is most tempting.

A company using these systems should therefore validate at the behaviour level, not only the answer level. The right question is not:

Did the synthetic customer answer the survey like our target segment?

The better question is:

When placed in an interaction where that segment’s stated beliefs should constrain behaviour, does the agent still behave accordingly?

That shift is the paper’s business value. It gives evaluators a way to move from persona cosmetics to behavioural diagnostics.

The evaluation frame businesses should borrow

The paper’s framework can be translated into a practical governance test for synthetic-user workflows.

Business use case	Weak validation	Stronger validation inspired by the paper
Virtual focus groups	Agents produce plausible comments	Agents maintain segment-specific preferences during disagreement
Product concept testing	Agents rate features by demographic persona	Ratings predict later trade-off behaviour in simulated discussion
Policy or comms testing	Agents react to a message	Agents resist or update in ways consistent with elicited openness
Sales objection simulation	Agents list objections	Agents continue defending objections under persuasive counterarguments
Customer journey simulation	Agents describe pain points	Pain points remain stable across multi-step interactions

This does not require copying the paper’s exact setup. A bank testing loan-product messaging, a SaaS firm testing pricing objections, or a hotel group testing corporate-event packages could design their own latent variables: price sensitivity, risk tolerance, brand loyalty, urgency, trust, switching friction. The important part is to separate what the agent says about itself from what it does later.

For example, if a synthetic SME customer says cash-flow predictability is critical, it should not casually accept a volatile repayment structure two turns later because the other agent made a friendly point. If a synthetic hotel event planner says location near an airport is the key constraint, the agent should not drift into prioritising lobby aesthetics unless the scenario gives a credible reason. Otherwise the simulation is not revealing preference dynamics. It is producing polite improvisation.

The boundaries: this is a warning signal, not a universal impossibility proof

The study is precise enough to be useful, but not broad enough to settle every question about synthetic participants.

First, the interaction setting is structured and short. Conversations run for a limited number of turns, and agreement is scored at the end. Longer simulations with memory, retrieval, reinforcement learning, or explicit self-consistency mechanisms could behave differently.

Second, the agent profiles are US-oriented and built from controlled demographic categories. That is appropriate for the experiment, but it means the findings should not be casually projected onto every cultural, organisational, or consumer context.

Third, the models tested are open-weight families at selected sizes: Gemma 3, Llama 3.x, and Qwen3. The result is not a claim about every frontier model or every agent architecture. The paper’s stronger claim is about the fragility of current LLM-agent coherence under this style of test.

Fourth, the outcome measure relies on LLM-as-judge agreement scoring. The authors calibrate the judge with annotated examples, but judge choice still matters. In business settings, human audit samples or task-specific scoring rules would be preferable where decisions carry real cost.

Finally, the behavioural models are simplified. Preference and openness are useful test variables, but real human behaviour is messier. People change their minds for reasons beyond openness. They agree socially while disagreeing privately. They posture. They avoid conflict. They misread questions. Humanity, alas, has not been optimised for clean experimental design.

Those caveats do not rescue the substitution thesis. They simply define the shape of the evidence. The paper shows that plausible LLM personas can fail coherence tests even under relatively controlled conditions. That should make business users more careful, not more theatrical in their disclaimers.

The real lesson is not that agents agree; it is that they agree in the wrong places

The title question — are LLM agents behaviourally coherent? — has a nuanced answer. At the surface level, often enough. At the level where stated preferences, prompted bias, openness, topic contentiousness, and dialogue outcomes must all line up, not reliably.

That distinction is the whole article.

Synthetic agents are becoming easy to create and easy to believe. They give clean outputs, scalable samples, and the flattering illusion that messy human response can be precomputed. But this paper shows why the illusion is fragile: the agent’s “personality” may not be a stable behavioural constraint. It may be a prompt-flavoured tendency that dissolves when the conversation demands social balance.

For managers, researchers, and product teams, the answer is not to throw away synthetic personas. It is to demote them. Use them as exploratory instruments, not as substitute evidence. Ask them questions, but then test whether their answers bind their later behaviour. Make disagreement part of the evaluation. Reward agents for holding ground when their profile says they should, not merely for producing agreeable conversation.

The future of synthetic-user research will not be decided by whether LLMs can sound human. They already can. The harder question is whether they can remain the same human when the dialogue gets inconvenient.

Right now, too many of them are agreeable to a fault. Which is charming in dinner guests, less charming in market research.

Cognaptus: Automate the Present, Incubate the Future.

James Mooney, Josef Woldense, Zheng Robert Jia, Shirley Anugrah Hayati, My Ha Nguyen, Vipul Raheja, and Dongyeop Kang, “Are LLM Agents Behaviorally Coherent? Latent Profiles for Social Simulation,” arXiv:2509.03736. ↩︎

The substitution thesis is tempting because surface realism is easy to see#

The mechanism: first ask the agent who it is, then see whether it acts accordingly#

Surface tests pass because broad correlations are the easy part#

The deeper tests expose an agreeability bias with better manners than judgement#

The results are not just one model having a bad afternoon#

Synthetic customers are useful until they become evidence#

The evaluation frame businesses should borrow#

The boundaries: this is a warning signal, not a universal impossibility proof#

The real lesson is not that agents agree; it is that they agree in the wrong places#