If you’ve been tempted to A/B‑test a marketing idea on thousands of synthetic “customers,” read this first. A new study introduces a dead‑simple but devastating test for LLM‑based agents: ask them to first state their internal stance (preference) and their openness to persuasion, then drop them into a short dialogue and check whether their behavior matches what they just claimed. That’s it. If agents are believable stand‑ins for people, the conversation outcome should line up with those latent states.
It doesn’t—at least not reliably. The authors find consistent internal incoherence across model families and sizes. Agents amplify agreement, dodge disagreement, and treat negative sentiment as second‑class—despite being primed to hold strong views. For leaders relying on agent‑based simulations for strategy, policy, or UX testing, this is a core reliability gap, not a minor quirk.
The experiment—in plain English
The framework runs five steps: choose a topic (from light to contentious), generate diverse agent profiles (age, gender, region, education) and optionally bias them toward a stance, measure two latent traits—(1) a 1–5 preference on the topic and (2) a multi‑item openness score—and then pair agents for a short dialogue. A separate LLM judge scores each dialogue’s agreement from 1 (disagree) to 5 (agree). The question: do the observed agreement scores fit what we’d predict from the pair’s preferences and openness?
What should happen vs. what actually happens
-
Expected: As the preference gap between two agents widens (e.g., 5 vs 1), agreement should fall; high openness should increase convergence, especially when starting far apart; equally positive pairs (5–5) and equally negative pairs (1–1) should behave symmetrically.
-
Observed: Yes on the broad trend—bigger gaps correlate with less agreement—but three deeper failures emerge:
- Disagreement suppression: Even maximally opposed pairs rarely end in true disagreement; scores cluster around neutral/agreeable.
- Positive–negative asymmetry: (1–1) pairs agree less than (5–5), meaning shared dislike doesn’t bind as well as shared enthusiasm.
- Topic overshadowing: The contentiousness of the topic (e.g., taxes vs. soda) sways outcomes even when the pair’s stated preferences are held constant.
The same themes appear when swapping model families (Qwen, Llama, Gemma, Mistral, Olmo) and sizes. Surface‑level sanity checks pass; deeper coherence tests fail.
Why this matters for business and policy
When we use LLM agents as experimental subjects—to preview consumer reactions, simulate workforce change management, or pressure‑test policy narratives—we implicitly assume internal consistency: what an agent says it believes should guide what it does next. These results show that today’s agents are performative reasoners: they can sound aligned with themselves without actually being aligned when context shifts.
Risk ledger
Failure mode | What you’ll see in practice | Strategic risk | Mitigation (today) |
---|---|---|---|
Disagreement suppression | Synthetic users “harmonize” quickly; controversy looks solvable with mild messaging | Underestimates backlash, overestimates consensus | Force adversarial roles; inject disagreement budgets (min % of 1–2 scores); audit with human panels |
Positive–negative asymmetry | Shared dislikes don’t mobilize; negative framings look weaker than they are | Misreads anti‑adoption coalitions; misweights risk communications | Balance tests with valence controls (run mirrored positive/negative prompts); cross‑check with human focus groups |
Topic overshadowing | “Hot” topics bias results regardless of the audience’s stated preferences | False inferences about segmentation and message fit | Stratify by topic contentiousness bins; adjust with post‑strat weights; include benign counter‑topics as calibration |
Openness failure at extremes | Even “stubborn vs stubborn” pairs drift toward neutrality | Overconfidence in persuasion strategies for polarized audiences | Stress‑test with locked‑state agents (rule‑based refusal to update) as baselines |
A better protocol you can run this week
-
Dual‑ground your priors. Don’t trust an agent’s self‑reported traits. Pair trait elicitation with revealed‑preference probes (mini choices, consistency traps). Keep both.
-
Symmetry batteries. For every test condition (e.g., pro‑tax vs anti‑tax), run its mirrored counterpart and compare (5–5) vs (1–1), (4–5) vs (1–2), etc. Large asymmetries are red flags.
-
Counterfactual calibration. Create a synthetic “expected” distribution by inverting the best‑aligned case (how the paper estimates suppressed disagreement). If observed distributions are far more agreeable than the counterfactual, you’re seeing sycophancy.
-
Judge diversity. Use out‑of‑family judges and human spot checks. Rotate judges; never let the agent grade its own homework.
-
Constraint the play. Add disagreement floors, turn caps, and role commitments (e.g., “union rep,” “compliance officer”) so agents can’t lazily converge.
-
Report with friction. Publish variance, not just means. Show full agreement histograms and bootstrap CIs. Neutral scores (3) should be interrogated, not celebrated.
What this says about LLM architecture today
The pattern—agreeable language overriding latent state—is exactly what you’d expect from systems trained to minimize next‑token loss with a strong helpful/harmless bias. These models optimize for polite, face‑saving conversation, not belief‑consistent action. Reflection, tool use, or memory add scaffolding, but unless you reward consistency itself, the agent will keep performing harmony.
This doesn’t mean “agents don’t work.” It means you should treat them like stochastic UX dummies—great for ideation and coverage, weak for causal inference about human behavior without guardrails. If you’re simulating social dynamics (adoption, trust, conflict), build mechanistic backbones (utility, norms, thresholds) and let the LLM fill in language, not the laws of motion.
What would convince us agents are ready as substitutes?
- Trait‑to‑action contracts: Pre‑registered mappings from trait bins → behavioral thresholds (e.g., openness ≥7 must shift ≥1 point in 40–60% of max‑gap encounters), then blinded evaluation.
- Valence parity: No systematic advantage for positive over negative alignment across mirrored conditions.
- Topic‑invariant alignment: Holding preferences constant, agreement distributions are statistically indistinguishable across low/medium/high contentiousness.
- Judge‑robustness: Passes above with at least two out‑of‑family judges + human audit.
Until then, keep agents in the loop, not in the driver’s seat.
Bottom line for Cognaptus clients
- Use LLM agent sims to generate hypotheses, map edge cases, and draft messaging variants—not to forecast public reaction without human data.
- When you must simulate, enforce hard constraints and publish asymmetric‑risk diagnostics (did negativity get discounted?).
- Treat large neutral clusters as a model artifact until disproven by humans.
Cognaptus: Automate the Present, Incubate the Future