TL;DR for operators

LLM survey panels are cheap, fast, and extremely willing to give you numbers. That is exactly why they are dangerous.

A recent paper by Jens Rupprecht, Georg Ahnert, and Markus Strohmaier stress-tests nine instruction-tuned LLMs on World Values Survey-style questions and finds that small prompt changes can materially alter synthetic survey responses.1 The study runs 167,400 simulated interviews across 62 normative survey questions, 25 repeated runs per model-question-condition, and a battery of perturbations covering answer-order reversal, refusal-option removal, odd/even scale changes, priming text, typos, synonyms, paraphrases, and a combined paraphrase-plus-reversal condition.

The operational lesson is blunt: do not treat LLM-generated survey answers as stable “opinions” unless you have first tested whether the model keeps its response distribution stable when the survey instrument is slightly reworded.

The strongest finding is answer-order sensitivity. Across all nine models, the same answer option became more likely when moved to the end of the list. In one case, Llama-3.1-8B selected the semantically same option more than 20 times as often when it appeared last rather than first. That is not a charming little quirk. It is a measurement problem wearing a chatbot costume.

Larger models help, but they do not solve the issue. Llama-3.3-70B and Gemini-1.5-Pro are generally more robust than smaller models, often reproducing original response distributions in over half of tested cases. The smallest Llama model replicated original distributions in fewer than 5% of cases on average. Still, even the stronger systems remain sensitive to paraphrasing and combined perturbations.

For market research, customer insight, policy simulation, brand testing, and synthetic-panel products, the correct takeaway is not “never use LLMs.” The correct takeaway is: use them like a measurement instrument, not like a population. Instruments require calibration. Populations require sampling. Confusing the two is how dashboards become fiction.

The cheap focus group has a measurement problem

The attractive promise of LLM-based survey research is obvious. Instead of recruiting respondents, compensating panels, waiting days, cleaning messy data, and discovering that half the answers are unusable, one can ask a model to simulate responses immediately. The spreadsheet fills itself. The slide deck looks decisive. The budget holder smiles.

This is the precise moment to become suspicious.

Surveys are not merely containers for questions. They are measurement devices. The order of answer options, the presence of a neutral category, the wording of refusal options, and the phrasing of the question all affect the answer. Human survey methodology has known this for decades. Respondents satisfice. They lean toward available middle options. They may choose “don’t know” when allowed, or float toward the centre when not. They may favour early options in visual contexts or later options in oral contexts.

The paper’s contribution is to ask whether LLMs, when used as synthetic respondents, inherit similar vulnerabilities. It does not ask whether LLMs can sound plausible. That bar has been cleared so often it now lies underground. It asks a harder question: when the semantic content of the survey is broadly preserved but the instrument changes, do model response distributions remain stable?

The answer is no, or at least: not reliably enough to support naive use.

The paper audits the survey instrument, not just the model

The study uses questions from Wave 7 of the World Values Survey, selecting 62 value-oriented question-and-answer pairs from 259 core variables while excluding sociodemographic variables. The topics include areas such as trust, institutional confidence, moral justifiability, and views of democracy. These are not factual quiz items with one correct answer. They are normative questions, which makes them precisely the kind of material people are tempted to use LLMs for when they want “synthetic public opinion.”

The authors test nine instruction-tuned models:

Model family / model Role in the study
Gemini-1.5-Pro Large proprietary reference model
Llama-3.3-70B-Instruct Large open model reference point
Llama-3.1-8B-Instruct Mid-sized Llama variant
Llama-3.2-3B-Instruct Smaller Llama variant
Llama-3.2-1B-Instruct Smallest Llama variant
Mistral-7B-Instruct-v0.3 Compact instruction-tuned model
Phi-3.5-mini-instruct Small but instruction-capable model
Qwen2.5-7B-Instruct Chinese-origin instruction-tuned model
Yi-1.5-6B-Chat Chinese-origin chat model

Each model receives a forced-choice prompt: answer the question, pick one of the listed options, and return only the label. The authors repeat each model-question-condition combination 25 times, giving them response distributions rather than single outputs. That design matters. A single LLM answer can always be dismissed as noise. A shifted distribution is harder to wave away.

The perturbation battery is the article’s real engine. It includes answer-option perturbations designed to reveal known survey biases, plus question-text perturbations designed to test robustness to surface variation.

Test Likely purpose What it supports What it does not prove
Reversed answer order Main bias evidence Whether position changes selection frequency The internal cognitive mechanism causing the effect
Missing refusal option Bias/sensitivity test Whether models redistribute uncertainty when “don’t know” disappears That models experience uncertainty like humans
Odd/even scale transformation Bias/sensitivity test Whether adding/removing a middle category shifts answers toward the centre That the centre answer reflects genuine moderation
Emotional priming suffix Exploratory bias test Whether pressure text changes refusal or answer behaviour A universal theory of persuasion in LLMs
Typos, letter swaps, keyboard typos Robustness test Whether noisy input disrupts response stability That survey respondents would react similarly
Synonym replacement and paraphrase Robustness test Whether semantically similar wording preserves distributions That all paraphrases are semantically equivalent
Paraphrase plus reversed options Stress test / interaction Whether combined perturbations compound instability Which component caused the entire shift
Regex-based answer extraction Implementation detail Valid response parsing for 167,400 outputs Perfect error-free measurement in every edge case
Refusal and invalid-response analysis Reliability check Whether some models or topics produce unusable outputs A complete map of all model guardrails
Scale-size comparison Robustness/sensitivity test Whether longer Likert scales reduce reproducibility That shorter scales are always better for human research

That distinction matters because the paper is not one argument repeated ten times. It is an audit from different angles: response robustness, human-like bias, instruction adherence, model-specific refusal, and scale sensitivity.

The main result is not “LLMs are biased.” It is that the measurement moves

The easy headline is that LLMs show human-like survey bias. Fine. But the more useful business conclusion is sharper: LLM survey outputs are highly sensitive to the measurement instrument.

The authors measure robustness using Kullback-Leibler divergence between the response distribution under the original wording and the distribution under a perturbed version. A KL divergence of zero means the perturbed version produced the same distribution as the original. They also measure response consistency using entropy across repeated runs of the same setup. Low entropy means the model tends to give the same answer repeatedly; high entropy means its answers scatter across the options.

The larger models are better. Llama-3.3-70B and Gemini-1.5-Pro often reproduce their original response distributions in more than 50% of cases. The smaller Llama models perform much worse, with Llama-3.2-1B replicating original distributions in fewer than 5% of cases on average.

This is the first operational separation:

  • Robustness asks whether the model keeps its response distribution when the instrument changes.
  • Consistency asks whether the model repeats itself under the same instrument.

A model can be consistent and still wrong for your use case. It can repeatedly produce the same biased distribution. It can also be unstable in ways that look like diversity but are actually measurement error. Synthetic research products that report only one response per prompt are therefore hiding the most important part of the experiment: the distribution.

For operators, the minimum viable protocol is not “ask the model once.” It is “ask the model repeatedly under controlled perturbations and compare distributions.” Not glamorous, admittedly. But neither is data hygiene, and civilisation somehow persists.

Recency bias turns answer placement into an attitude

The clearest bias result is answer-order sensitivity. The authors expected to look for primacy bias, where early options receive disproportionate attention. Instead, they find a consistent recency bias across all nine models.

When answer options are reversed, models become much more likely to select the option now placed at the end of the list, even though it is semantically the same option that appeared first in the original version. The strongest reported case is Llama-3.1-8B, where the semantically same option is selected more than 20 times as often after being moved to the last position. Other models show smaller but still positive effects.

The point is not that the last answer is always chosen. The point is that position changes the distribution. In human oral surveys, recency effects are often associated with memory and attention. With LLMs, the paper does not prove an internal mechanism. It shows the behavioural result. The causal explanation may involve sequence position, learned survey patterns, instruction-following dynamics, decoding behaviour, or some mixture of these. Anyone claiming to know exactly which without further mechanistic evidence is decorating ignorance with technical vocabulary.

For business use, the behaviour is already enough. If a customer-insight team asks an LLM whether buyers prefer “lower price,” “better service,” “faster delivery,” or “premium design,” and then changes the option order, the model may not merely reformat the answer. It may change the apparent preference structure.

That is not insight. That is a prompt artefact.

The practical rule is simple: rotate answer options, reverse them, randomise them where appropriate, and measure whether conclusions survive. If they do not, you do not have synthetic customer sentiment. You have answer-order sensitivity.

Paraphrasing is not harmless polish

The paper’s non-bias perturbations are especially relevant to business teams because they resemble ordinary workflow edits. Someone cleans up a survey question. Someone rewrites it for a deck. Someone localises the language. Someone asks the model again and assumes the meaning is unchanged.

The study finds that paraphrasing the full question generally reduces robustness more than replacing individual words with synonyms. This is important because paraphrasing is often treated as a cosmetic change. In LLM survey work, it can be a measurement intervention.

Typos also matter, but not equally. Random character substitutions and keyboard-adjacent errors are more harmful to robustness than simple letter swaps. The authors suggest that letter swaps may be more common in human typing data, which could make models more resilient to them. That explanation is plausible, but the business point does not depend on it. The larger lesson is that models do not merely process the “meaning” of the question in a clean semantic vacuum. They react to form.

The combined perturbation is the stress test: paraphrase the question and reverse the answer options at the same time. This condition produces the lowest robustness across almost all models, with Phi-3.5-mini as the noted exception. That result matters because real business usage rarely changes only one thing. A survey item may be shortened, translated, reordered, embedded in a different instruction template, and paired with altered answer scales. The compound effect can be larger than any single edit.

The comfortable belief is: “The question still means the same thing.”

The operational replacement is: “Show me the perturbation audit.”

Scale design is not formatting; it is model control

The paper also tests what happens when answer scales change. Removing a refusal option tests opinion floating: if “don’t know” disappears, do uncertain responses move toward the centre? Adding or removing a middle category tests central tendency: if a neutral option exists, does the model overuse it?

The findings are mixed, and that is precisely why they are useful.

For opinion floating, larger models such as Llama-3.3-70B, Gemini-1.5-Pro, and Phi-3.5-mini are largely robust. Smaller models, especially Qwen and Llama-3.1-8B, show a weak tendency to shift toward the scale centre when the refusal option is removed. The paper does not find a universal opinion-floating effect across all models.

Central tendency is clearer in some settings. Larger models including Llama-3.3-70B, Gemini-1.5-Pro, and Mistral tend to shift their mean response closer to the centre when an explicit middle option is provided. The middle category is selected significantly more often than expected under a uniform distribution, especially as scale size increases. In 11-point scales, all tested models except Llama-3.2-1B choose the middle category significantly more often than any other option.

This is where survey design becomes operationally dangerous. A neutral option can look like nuance. In synthetic LLM surveys, it may partly function as an attractor. A long Likert scale can look more precise. In the study, robustness generally declines as the number of answer options increases, with models less likely to reproduce original response distributions on 10-point scales than on shorter scales.

More options do not automatically mean more information. Sometimes they mean more ways for the model to wobble.

For commercial research, this matters when teams use LLMs to compare brand perceptions, willingness to pay, risk tolerance, employee sentiment, or policy preferences. A five-point scale and a ten-point scale may not be interchangeable. A neutral option may not merely capture neutrality. A removed refusal option may not merely force clarity. It may manufacture centrality.

Larger models buy stability, not immunity

The model-size finding will be popular because it is easy to purchase. Use larger models. Problem solved. Procurement departments adore this sort of conclusion because it turns epistemology into a vendor upgrade.

The paper supports only the first half. Larger models are generally more robust and more consistent. Llama-3.3-70B and Gemini-1.5-Pro do better than the smallest Llama variants. Smaller models show higher entropy, meaning more variability when asked the same question repeatedly. Llama-3.2-1B and Llama-3.2-3B are especially fragile in several conditions.

But size does not remove sensitivity. Even the larger models remain affected by answer order, paraphrasing, and combined perturbations. The paper’s recommendation to use larger models is sensible, but it is not a licence to skip validation.

The better business framing is this:

Decision What the paper supports What Cognaptus infers
Use larger models for synthetic surveys Larger models tend to be more robust and consistent Model choice should be treated as part of survey-method design, not just IT procurement
Prefer shorter scales when possible Robustness falls as answer-option scale length increases Avoid pretending that 10-point synthetic scales automatically give higher precision
Test answer order All models show recency bias Any synthetic ranking or preference survey needs option-order auditing
Test paraphrases Paraphrasing can reduce robustness more than synonym replacement Reworded prompts should be versioned and regression-tested
Monitor refusal and invalid responses Some models and topics produce high non-response rates Guardrails and model-specific restrictions can distort synthetic samples
Use forced-choice prompting carefully It improves valid answer extraction It may constrain models into artificial choices that do not generalise to open-ended interaction

That last row matters. The study uses forced-choice prompting because open-ended prompting did not reliably yield valid labels. This is a reasonable implementation decision. It is also a boundary. Forced-choice formats make statistical analysis possible, but they also turn a generative system into a multiple-choice respondent. The output is useful only within that measurement frame.

Refusals and invalid answers are not noise; they are evidence

The paper reports that, overall, 96% of interviews yielded an extractable valid answer from the given options. That sounds reassuring. Then the model-level detail arrives and ruins the mood, as details often do.

Llama-3.3-70B, Phi-3.5-mini, and Mistral-7B are generally reliable in producing on-scale answers, with non-response rates typically below 10%. Qwen2.5-7B and Llama-3.1-8B often exceed 30% non-response when invalid outputs and explicit “don’t know” selections are combined. On questions about perception of elections, Qwen2.5-7B fails to provide a valid on-scale answer in 91.3% of cases even under the original unperturbed version.

That result should not be treated as a side note. In synthetic survey work, non-response is part of the response process. A model that refuses or fails disproportionately on sensitive topics is not merely less convenient. It changes the shape of the synthetic population.

For business teams, this appears in less dramatic forms. A model may answer product-preference questions smoothly but behave differently around finance, health, labour, politics, demographics, safety, or legal liability. If a synthetic panel under-responds in precisely the areas where human opinion is most contentious, it may produce the illusion of consensus.

The operational response is not to remove refusal options blindly. The paper recommends reflecting carefully on refusal categories because they can reveal guardrails or restrictions. In other words, “don’t know” is not just a nuisance category. It can be a diagnostic instrument.

The appendix is not extra decoration

The appendix results are not a second thesis. They are robustness infrastructure.

Figure 4 extends the main robustness heatmaps to non-bias perturbations, showing model-specific differences and confirming that the smallest models are especially fragile. Figures 5 and 6 unpack the central-tendency and opinion-floating calculations, including distance-to-centre measures and binomial tests for middle-category selection. Figure 8 examines refusal and invalid responses, which helps interpret whether response distributions are reliable in the first place. Figure 9 studies entropy and shows that inconsistency is largely model-specific rather than perturbation-specific. Figures 10 and 11 compare robustness across scale sizes, supporting the finding that longer scales tend to reduce reproducibility.

For an operator, the appendix teaches a workflow:

  1. Test whether the model gives extractable answers.
  2. Test whether it repeats itself under the same prompt.
  3. Test whether it survives wording changes.
  4. Test whether answer order changes conclusions.
  5. Test whether scale length changes conclusions.
  6. Test whether refusal behaviour varies by topic.
  7. Only then consider using the synthetic output as directional evidence.

That sequence is less glamorous than “AI replaces polling.” It is also less likely to embarrass everyone involved.

What this means for synthetic market research

The paper directly shows that LLM responses to closed-ended normative survey questions are vulnerable to prompt perturbations. It does not show that all LLM-based market research is useless. It does not show that synthetic panels can never be informative. It does not show that human panels are pure and unbiased. Anyone who has read a real survey dataset knows better.

Cognaptus’ inference is narrower and more useful: LLM-based survey systems should be governed like measurement pipelines.

That means synthetic survey vendors and internal analytics teams should version their prompts, store answer-option order, document scale design, run perturbation tests, benchmark multiple models, and report uncertainty induced by prompt variants. If the conclusion changes when the answer list is reversed or the question is paraphrased, the output should be labelled as unstable.

A practical validation protocol might look like this:

Validation layer Business question it answers Failure signal
Repeated-run consistency Does the model reproduce its own distribution? High entropy across identical runs
Answer-order audit Does option placement drive conclusions? Preference rankings change after reversal
Paraphrase audit Does wording change the measured attitude? Large distribution shift under semantic rewording
Scale audit Does scale length or neutral-option design alter results? Centre migration or loss of robustness on longer scales
Refusal audit Are sensitive topics under-answered? Topic-specific invalid or “don’t know” spikes
Model comparison Is the result vendor- or model-specific? Different models imply different decisions
Human calibration Does synthetic output align with real target data? No known ground-truth anchor

The last row is the one people will want to skip. They should not.

Synthetic respondents are most defensible when used for survey pre-testing, prompt sensitivity analysis, early hypothesis generation, questionnaire debugging, and scenario exploration. They are least defensible when used as a cheap substitute for real human evidence in decisions that affect customers, citizens, patients, employees, or regulated financial behaviour.

LLMs can help you test the instrument. They should not silently become the population.

The boundary: closed-ended normative surveys under forced choice

The paper’s limitations are not boilerplate. They shape how the findings should be used.

First, the study focuses on closed-ended normative survey questions. The results may not generalise to open-ended interviews, deliberative reasoning tasks, persona-grounded simulations, or agentic workflows where the model can ask clarifying questions.

Second, the prompting setup is intentionally constrained. The authors use forced-choice prompts to obtain valid labels. They do not test persona prompting, chain-of-thought-style deliberation, or other strategies that might change response behaviour. Persona prompting could increase contextual consistency, but it could also introduce a new layer of prompt sensitivity. That remains future work, not a solved escape hatch.

Third, the perturbations are applied at fixed intensity. A severe paraphrase and a mild paraphrase are not the same intervention. The paper manually validates generated paraphrases and scale transformations, but more granular human validation could improve semantic assurance.

Fourth, the study uses instruction-tuned models, not base models. It also keeps temperature at default settings rather than varying it systematically. That means the results are about the tested deployment-like model configurations, not every possible generation regime.

Fifth, closed-source models can change over time. Gemini results today may not match Gemini results after a silent model update. This is an underrated problem for synthetic research. If the respondent changes without notice, longitudinal comparability becomes theatre.

These boundaries do not weaken the paper’s practical value. They prevent overreach. The study should not be used to declare synthetic surveys dead. It should be used to demand that anyone selling or using them shows their robustness work.

The replacement belief: synthetic surveys are instruments, not respondents

The misconception to retire is simple: if the meaning of the question is preserved, LLM-generated survey answers should be stable enough to proxy human attitudes.

The replacement belief is more disciplined: LLM-generated survey answers are outputs of a prompt-conditioned measurement process. That process can be useful, but only if its sensitivity is tested.

This is not a philosophical nicety. It changes procurement, methodology, and product design. A synthetic-panel tool should not merely report “62% prefer option B.” It should report whether that preference survives option reversal, paraphrase, scale changes, refusal-option variation, and model substitution. A customer-insight workflow should not treat prompt editing as harmless copy work. It should treat it as an experimental change. A policy simulation should not use synthetic responses where real affected communities are necessary, then congratulate itself for efficiency. Synthetic silence is still silence.

The paper’s uncomfortable lesson is that LLMs may reproduce not only the content of human opinion, but also the survey artefacts through which opinion is measured. Sometimes the echo chamber is not in the training data. Sometimes it is in the answer list.

For operators, that is useful. A fragile instrument can still be valuable if calibrated. An uncalibrated one just produces confidence at machine speed.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jens Rupprecht, Georg Ahnert, and Markus Strohmaier, “Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses,” arXiv:2507.07188v3, 2025. ↩︎