Cultural Alignment: When Prompts Stop Being Instructions and Start Being Policy

A prompt is usually treated as a small operational detail. Someone writes it, someone tests it, someone pastes it into a workflow, and then everyone pretends the wording is just a user-interface choice.

That fiction becomes expensive when the prompt sits inside a compliance workflow, a policy-support tool, a market research assistant, or an internal audit system. In those settings, the model is not merely choosing words. It is deciding what kind of answer feels reasonable, what kind of trade-off deserves attention, and what kind of social assumption can pass quietly as common sense.

The paper behind today’s discussion, Prompt Programming for Cultural Bias and Alignment of Large Language Models, makes that issue measurable in a useful way.¹ It does not ask whether LLMs have “culture” in any mystical sense. It asks a cleaner question: when open-weight models answer survey-style questions used in cross-cultural research, where do their answers land relative to real country-level human survey benchmarks?

Then it asks the more operational question: if a generic model has a cultural prior, can a better prompt move it closer to a target population? And if a hand-written country prompt helps, can prompt programming help more?

The important distinction is this: the paper is not mainly about making models sound locally polite. It is about whether cultural alignment can be treated as a measurable engineering problem. That is where the business relevance begins.

The comparison that matters: no prompt, manual prompt, optimized prompt

The easiest bad reading of this paper is: “LLMs are Western-biased, so add a country persona.” That is neat, memorable, and not quite wrong. It is also the kind of conclusion that lets a localization team declare victory after adding one line: “You are a citizen of X.”

The paper shows why that is insufficient.

Its structure is built around three regimes:

Regime	What changes	What it tests	Business interpretation
Generic prompting	The model answers survey items without country conditioning	The model’s default cultural prior	What your system may do when localization is absent or weak
Manual cultural prompting	The prompt adds a country identity such as “You are a citizen of X”	Whether simple persona conditioning reduces distance from human country benchmarks	The baseline version of “localized AI”
DSPy prompt programming	The cultural instruction becomes an optimizable prompt component	Whether prompts can be compiled against a cultural-distance objective	A more systematic alignment layer, but not a magic wand

This comparison is more useful than a conventional paper summary because it maps directly onto business choices. A firm deploying an LLM across several countries must decide whether to rely on the model default, write country prompts manually, or build an evaluation-and-optimization loop. Those are not academic categories. They are deployment strategies.

The paper’s answer is not “always optimize everything.” It is more precise: generic prompting is culturally compressed; manual country prompting helps; DSPy-based prompt programming can improve further, especially under the right optimizer and proposer model, but its gains are uneven across models and countries.

A slightly annoying conclusion, therefore, but a useful one: cultural alignment is not a checkbox. It is a measurement loop.

How the paper turns culture into something a model can be scored against

The authors build on earlier work using the Integrated Values Surveys, which harmonize World Values Survey and European Values Study data. They focus on the Inglehart–Welzel cultural map, a two-dimensional representation often described through two axes: Survival versus Self-Expression values, and Traditional versus Secular values.

For the model, the process is intentionally constrained. The authors do not ask the model to write essays about culture. They ask it to answer ten survey items used in constructing the cultural map. These include questions related to happiness, trust, authority, petition signing, national pride, and autonomy. The model must return fixed survey-style answers, not explanations.

That design matters. It reduces the problem from open-ended cultural interpretation to a comparable measurement task. Each model’s answer vector is standardized using the survey-derived benchmark, projected into the same two-dimensional cultural space, and then compared with country or territory reference points derived from human survey responses.

Alignment becomes distance:

$$ \text{cultural distance} = \left|\text{model coordinate} - \text{human country benchmark}\right|_2 $$

Smaller distance means the model response profile is closer to that country’s survey-derived value profile.

This is not a complete theory of culture. It is a measurement instrument. That difference is important. The instrument is narrow enough to be tested and repeated, but narrowness also creates boundaries. A model that aligns better on forced-choice survey items may not necessarily behave better in open-ended strategy memos, HR policy drafts, legal summaries, or multi-turn advisory systems.

Still, as a diagnostic, the setup is valuable. Many AI governance conversations die in the swamp of vague abstractions: fairness, neutrality, inclusiveness, respect. Here the authors define a target space, generate model coordinates, compute distances, and compare interventions. One may disagree with the instrument. At least the instrument exists. That is already progress; not glamorous progress, but progress.

Generic open-weight models cluster into a narrow cultural footprint

The first finding is a replication-and-extension result. Earlier work found that proprietary LLMs, under generic prompting, cluster near Western value profiles. This paper asks whether the same pattern appears in open-weight models.

The authors test five open-weight models: Llama 3.3 70B, Llama 4 16x17B, Gemma 3 27B, GPT-OSS 20B, and GPT-OSS 120B. Under generic prompting, all five model projections fall into a relatively tight region of the cultural map. They do not spread across the full distribution of country and territory benchmarks.

That is the first business-relevant point. Model openness does not automatically imply cultural plurality. Different architectures, scales, and training regimes shift the exact point slightly, but the dominant pattern remains a compressed default profile. The paper describes this as a shared orientation toward the high self-expression side of the map and away from many dense country and territory groupings.

For enterprises, the practical reading is simple. If your system uses a general-purpose LLM with weak localization, it is not culturally blank. It carries a default prior. That prior may be reasonable for some settings and badly mismatched for others.

This matters most in workflows where the model is asked to prioritize, classify, or justify. A summarizer deciding which stakeholder concerns are salient, a compliance assistant deciding what seems “material,” or a policy drafting tool deciding which trade-offs deserve emphasis may quietly inherit the model’s default value profile. The output will look professional. That is the dangerous part. Professional prose is very good at laundering assumptions.

Manual country prompting helps, but it does not close the gap

The second regime is manual cultural prompting. The prompt adds a lightweight country identity, using a phrase of the form: “You are a citizen of X.” The survey question and answer constraints remain otherwise fixed.

This helps. In the paper’s distance distributions, manual country prompting substantially reduces cultural distance across the evaluated models. It also narrows dispersion. In plain language: telling the model which country to simulate moves its answers closer to the corresponding human benchmark.

That is encouraging, but not surprising. A country label is a strong cue. Models have learned many associations between countries and social, political, religious, and economic patterns. The cue activates part of that learned structure.

The more important result is that manual prompting does not solve the problem. Nontrivial tails remain. Some countries remain harder to align. The effect is uneven across the cultural map.

This is where a common enterprise misconception needs correction. A country prompt is not the same as cultural alignment. It is a cheap intervention that often helps. Cheap interventions are excellent. They are also famous for being mistaken for solutions.

A manual country prompt has at least three weaknesses.

First, it depends on wording. Slightly different persona descriptions can move outputs, which is why the paper averages over semantically similar respondent descriptors to reduce prompt-phrasing variance.

Second, it assumes the model has enough internal representation of the target population to respond appropriately. For countries or groups that are underrepresented, stereotyped, or unevenly covered in training data, the prompt may produce a shallow approximation.

Third, it does not optimize against the actual benchmark. It is a human-crafted cue, not a measured adjustment. The prompt writer may feel culturally sensitive. The distance metric may disagree. The metric is rude in a productive way.

DSPy changes the question from “what should we write?” to “what reduces distance?”

The paper’s second contribution is to introduce prompt programming with DSPy into this cultural alignment setting.

The conceptual move is straightforward. Instead of treating the cultural prompt as a fixed sentence, the authors treat it as a parameter in a prompt program. DSPy then searches for instructions that improve an explicit objective: reducing cultural distance to human country benchmarks.

This is the shift from prompt engineering to prompt programming.

Prompt engineering asks: what instruction sounds right?

Prompt programming asks: what instruction scores better under a defined metric?

The paper evaluates two DSPy teleprompters: COPRO and MIPROv2. COPRO performs more conservative instruction-level refinement. MIPROv2 conducts a broader multi-stage search and can use Bayesian optimization over candidate instructions and demonstrations. The authors also vary the instruction-proposal model, comparing a small proposer, Llama 3.2 1B, with a larger proposer, GPT-OSS 120B. The target model still produces the survey responses; the proposer generates candidate prompt instructions.

That separation is operationally important. A company does not necessarily need to change the deployed model to improve the instruction layer. It can use a stronger model to propose and optimize prompts for a target system, then evaluate whether those prompts improve measured alignment.

The strongest configuration in the paper is MIPROv2 with the GPT-OSS 120B proposer. This setting provides the largest additional reductions beyond manual prompting for every tested model except Llama 3.3. For Llama 4, all DSPy variants improve over manual prompt engineering. For Gemma 3 and the GPT-OSS target models, gains are more selective: MIPROv2 with the GPT-OSS 120B proposer is the consistently stronger configuration, while other DSPy setups are smaller or negligible.

That pattern is useful because it prevents two lazy conclusions. The first lazy conclusion is “prompt optimization always beats human prompts.” The paper does not show that. The second lazy conclusion is “manual prompts are enough.” The paper does not show that either.

The better interpretation is comparative: optimization can add value when the search procedure and proposer model are capable enough, but cultural alignment remains model-dependent and country-dependent.

The country-level plots show why averages are not enough

Figure 3 is especially useful because it disaggregates the effect by country or territory for GPT-OSS 120B aligned with MIPROv2 using the GPT-OSS 120B proposer. Each panel shows three points: the generic model projection, the aligned country-conditioned projection, and the human reference point. The arrow shows the model’s movement after alignment, and the dashed segment shows the remaining gap.

The key point is asymmetry.

For Western or Western-adjacent countries, the improvement can be small because the generic model is already relatively close. The United States, for example, shows a modest improvement of $\Delta = +0.142$. New Zealand is almost unchanged at $\Delta = -0.018$, while Canada shows $\Delta = -0.052$, meaning the aligned result is slightly worse under that metric.

For countries farther from the model’s generic position, the movements can be much larger. Jordan shows $\Delta = +4.289$. Egypt shows $\Delta = +4.055$. Qatar shows $\Delta = +3.962$. Ghana, Algeria, Ethiopia, Zimbabwe, and Bangladesh also show large positive improvements above $+3.5$ in the figure.

This is where the paper becomes more than another “bias exists” study. It shows that cultural alignment is not equally difficult everywhere. If the model begins near one region of the cultural map, some countries require only local adjustment while others require large relocation.

That distinction matters for product teams. A global deployment plan that reports only average alignment improvement may hide the cases where the system remains weak. The worst failures may not appear in the aggregate because the model performs acceptably in markets culturally closer to its default prior.

A good evaluation dashboard would therefore not stop at “average distance reduced by X.” It would include country-level distance, improvement, and residual gap. Otherwise, the global average becomes the usual managerial sedative: comforting, smooth, and occasionally harmful.

What the experiments support, and what they do not

The paper contains several kinds of evidence. Mixing them together would make the result look more sweeping than it is, so it is worth separating their likely purpose.

Evidence or test	Likely purpose	What it supports	What it does not prove
IVS/WVS/EVS cultural map replication	Benchmark construction and comparison with prior work	A shared coordinate system for human country benchmarks and model projections	A complete account of culture
Generic model projections in Figure 1	Main evidence for default cultural skew	Open-weight models cluster in a narrow region rather than spanning global country distributions	That every real deployment output will show the same pattern
Three-regime distance distributions in Figure 2	Main evidence comparing interventions	Manual prompting reduces distance; DSPy can improve further under some configurations	That prompt optimization always beats manual prompting
COPRO versus MIPROv2 comparison	Ablation/comparison of optimization strategies	Broader optimization with a stronger proposer is more consistently effective in this setup	That MIPROv2 is universally best for all cultural alignment tasks
Small versus large proposer model	Sensitivity test on optimization capability	Candidate instruction quality matters; GPT-OSS 120B proposer is stronger here	That only very large proposer models are viable in production
Five-fold country cross-validation	Robustness/generalization test	The authors are not merely optimizing on one fixed country set	That downstream business workflows will generalize automatically
Per-country movement panels in Figure 3	Diagnostic/exploratory evidence	Alignment gains differ sharply by country and residual gaps remain	That every positive movement is socially desirable in a policy sense

The last column is not academic hair-splitting. It is where many AI adoption errors begin. A benchmark improvement is not a governance decision. It is evidence that an intervention moved a model closer to a defined target. Whether that target is the right one for a specific business process is a separate question.

The business value is diagnosis before deployment, not cultural decoration after launch

For business users, the paper is most useful when translated into an operational workflow.

The old localization workflow looks like this:

Translate interface text.
Add local examples.
Add a country-specific prompt.
Hope nothing embarrassing happens.

The paper implies a stricter workflow:

Define the target population or stakeholder group.
Select or build a benchmark for relevant values, preferences, or decision criteria.
Measure the model’s default distance from that benchmark.
Test manual prompting against the same metric.
Use prompt optimization only when it improves held-out targets, not just the training set.
Audit country-level residual gaps before deployment.
Re-test on downstream tasks, because survey alignment is only the diagnostic layer.

This is not just cultural sensitivity training for chatbots. The same logic applies to any system where model outputs influence prioritization or judgment.

Consider a multinational compliance team. A generic model may summarize regulatory risk in a way that reflects assumptions common in its dominant training context. A country prompt may improve local framing. But if the firm cannot measure whether the output is closer to local legal, social, or institutional expectations, it is still mostly guessing.

Or consider market-entry analysis. A model asked to evaluate consumer attitudes, institutional trust, labor expectations, or public-sector acceptance may default to a worldview that fits some markets better than others. Prompt programming cannot substitute for field knowledge. But it can expose whether the prompt layer is moving the model in the intended direction.

The operational benefit is not that companies can automate culture. That would be a terrible sales pitch, and possibly a confession. The benefit is cheaper diagnosis: before using an LLM in a local decision-support workflow, a firm can test whether the system’s assumptions are measurably misaligned with the target population.

Cultural alignment is not automatically the same as ethical alignment

One boundary deserves special treatment. The paper measures distance from country-level human survey benchmarks. Moving closer to a benchmark means the model better matches observed survey responses. It does not necessarily mean the output is more ethical, more legally acceptable, or better aligned with universal rights principles.

This is not a flaw in the paper. It is a boundary of the objective.

Cultural alignment answers one question: does the model reflect the values or response patterns of a target population more closely?

AI governance often needs additional questions: should the system reflect those values in this context? Are there legal constraints? Are there human-rights commitments? Are there corporate policy boundaries? Are vulnerable groups affected?

For a marketing localization assistant, closer cultural fit may be a clear benefit. For a public-policy assistant, a hiring tool, or a moderation system, the answer may be more complicated. Sometimes a model should understand a local value pattern without reproducing it as a recommendation.

This is why prompt programming should be treated as part of governance, not a replacement for it. It can optimize toward a target. It cannot decide whether the target is legitimate.

No optimizer comes with a moral philosophy module. Product managers keep trying to outsource that part, and the universe keeps refusing.

The paper’s real contribution is methodological discipline

The paper’s most important contribution is not the discovery that models are culturally skewed. That result is now familiar. The stronger contribution is the disciplined comparison across intervention layers.

Generic prompting reveals the default prior.

Manual prompting tests whether an obvious localization cue helps.

DSPy prompt programming tests whether an optimized instruction layer can improve on the manual cue under a measurable objective.

That sequence is the useful lesson for AI teams. Do not start with optimization. Start with diagnosis. Do not trust a country prompt because it sounds sensible. Compare it against a benchmark. Do not celebrate an average improvement until you inspect the country-level residuals. Do not assume survey-style alignment transfers to open-ended workflows. Validate downstream.

In other words, the article’s title is literal. Prompts stop being mere instructions when they determine which value system is activated inside a decision-support pipeline. At that point, prompts behave like policy: they encode priorities, shape acceptable reasoning, and define what kind of answer the system is trying to produce.

And once prompts become policy, they need the boring things policy always needs: measurement, versioning, audit trails, exception handling, and someone willing to ask who benefits from the chosen default.

Boundaries for applying this result

The paper gives businesses a useful diagnostic pattern, but its limits should shape adoption.

First, the evidence is based on forced-choice survey items. That makes the measurement clean, but production systems often operate through open-ended text, multi-step reasoning, retrieval, and tool use. A model that moves closer on IVS survey items may still mishandle a culturally sensitive policy memo.

Second, the experiment is English-only and short-form. Prior work cited by the paper suggests that prompt language and phrasing can change measured cultural alignment. A serious global deployment would need native-language tests and domain-specific tasks.

Third, country labels are crude. Countries contain regional, class, religious, generational, professional, and institutional differences. “You are a citizen of X” may be useful for a first benchmark, but it is not a fine-grained representation of social reality. Culture is not a dropdown menu, although enterprise software has certainly tried.

Fourth, prompt optimization can improve many targets while degrading some. The paper explicitly notes heterogeneous effects and remaining hard cases. This is why per-country evaluation matters.

Finally, the benchmark target itself must be chosen carefully. Survey-grounded alignment can help a model reflect population-level values, but business systems often need a layered target: local relevance, legal compliance, organizational policy, and ethical constraints.

The conclusion is not “do not use this.” The conclusion is “use this as a diagnostic layer, not as the whole governance system.”

Conclusion: localization becomes measurable, and therefore harder to fake

The uncomfortable lesson of the paper is that model behavior under generic prompting is not neutral. Open-weight models, like proprietary ones studied before, can occupy a narrow cultural region when asked value-laden questions without explicit cultural conditioning.

The useful lesson is that this prior can be measured and partially moved. Manual country prompting helps. DSPy prompt programming can help more under the right configuration, especially MIPROv2 with a strong proposer model in this study. But neither removes the need for disaggregated evaluation, downstream validation, and governance over the chosen target.

For Cognaptus readers, the business takeaway is straightforward: cultural alignment should not be treated as a line in a prompt template. It should be treated as an evaluated system property.

That changes the role of prompts. They are no longer informal instructions written by whoever had the cleanest Google Doc that day. They become a policy layer: measurable, optimizable, auditable, and capable of shifting the worldview embedded in a workflow.

The good news is that this makes localization less theatrical.

The bad news is that it makes localization harder to fake.

Cognaptus: Automate the Present, Incubate the Future.

Maksim E. Eren, Eric Michalak, Brian Cook, and Johnny Seales Jr., “Prompt Programming for Cultural Bias and Alignment of Large Language Models,” arXiv:2603.16827, 2026, https://arxiv.org/abs/2603.16827. ↩︎

The comparison that matters: no prompt, manual prompt, optimized prompt#

How the paper turns culture into something a model can be scored against#

Generic open-weight models cluster into a narrow cultural footprint#

Manual country prompting helps, but it does not close the gap#

DSPy changes the question from “what should we write?” to “what reduces distance?”#

The country-level plots show why averages are not enough#

What the experiments support, and what they do not#

The business value is diagnosis before deployment, not cultural decoration after launch#

Cultural alignment is not automatically the same as ethical alignment#

The paper’s real contribution is methodological discipline#

Boundaries for applying this result#

Conclusion: localization becomes measurable, and therefore harder to fake#