TL;DR for operators

The paper does not merely say that GPT-generated stories contain national clichés. That would be mildly interesting, in the way that discovering a tourist brochure likes sunsets is mildly interesting.

The sharper finding is structural. When Rettberg and Wigers prompted gpt-4o-mini to write 1,500-word “potential” stories for 236 demonyms, the model produced surface diversity—olive trees, fjords, forests, trains, village elders, festivals—but repeatedly returned to the same basic narrative machine: someone comes back to a small town or village, discovers that community or tradition has weakened, organises a symbolic event, and restores harmony.1

For business users, the operational risk is not just offensive stereotyping. It is story flattening. A model may produce content that appears locally flavoured while quietly standardising conflict, agency, time, change, and resolution. That matters for localisation, education, brand storytelling, public-sector communication, game writing, tourism, training simulations, and any workflow where “make this culturally relevant” becomes a button someone presses because apparently we learned nothing from stock photography.

The practical takeaway: cultural QA should not stop at checking whether the names, foods, landmarks, or holidays are plausible. It should also ask: Who acts? What counts as conflict? What kind of change is allowed? Are political and social tensions converted into harmless misunderstandings? Does every country somehow become a Hallmark town with better scenery?

The boundary is equally important. The study probes one model, gpt-4o-mini, through deliberately simple prompts. It does not prove that all LLMs always flatten all cultures. It does, however, provide a useful diagnostic pattern: surface localisation can coexist with deep narrative standardisation.

The model learned the decorations before it learned the story

A familiar localisation failure looks like this: the AI gives the wrong greeting, invents a festival, mangles a name, or drops a landmark into the wrong city. That is representational bias at street level. Easy to mock, easy to screenshot, easy to fix in a style guide.

Rettberg and Wigers are after a more expensive problem: narrative bias. The model may get the decoration roughly right while still imposing a single story grammar underneath. In their dataset, the country prompt changes the scenery. The plot often does not move very far.

That distinction is the mechanism-first reading of the paper. The model is not simply saying “Norway equals fjords” or “Palestine equals olive trees.” It is taking those associations and placing them inside a stabilising template. The result is not cultural specificity. It is a generic restoration arc wearing national costume.

The mechanism appears to involve at least four interacting forces:

Mechanism What it does in the generated stories Why operators should care
Training-data association Links countries to recurring symbols, names, motifs, and genres Produces plausible-looking local colour without local understanding
Probabilistic normalisation Amplifies common patterns and suppresses outliers Makes output safe, familiar, and repetitive
Alignment and filtering Reduces explicit violence, extremism, sexual content, and other risky material Can sanitise precisely the tensions that make stories historically or politically meaningful
Prompt sensitivity A single word such as “potential” can change language and behaviour Makes audit results fragile unless prompts are systematically tested

This is why the paper is more interesting than “AI stereotypes countries.” Stereotyping is visible on the surface. Standardisation hides inside the plot.

The experiment is simple by design, which is both its strength and its trap

The authors generated 11,850 stories. They used Statistics Norway’s country list, found demonyms for 236 countries, and generated 50 stories for each. They also generated 50 stories without a nationality marker. The base prompt was: “Write a 1500 word potential {demonym} story.”

The word “potential” is not decorative. The authors included it because earlier tests without it often produced summaries of existing novels rather than new stories. Later experiments dropping the word also changed the language behaviour: stories were more likely to appear in the language associated with the country, while the authors wanted English outputs for easier analysis.

That detail matters. It is best read as a sensitivity and implementation finding, not as a side anecdote. The paper’s dataset is deliberately minimal-prompt probing. It asks: what does the model default to when the human gives it very little story architecture?

The authors then combine several methods:

Analysis element Likely purpose What it supports What it does not prove
Word frequency analysis Main descriptive overview Shows common emotional and symbolic vocabulary across the dataset Does not by itself establish plot structure
Box plot of frequent words by country Main evidence plus exploratory signal detection Identifies cross-country regularities and outliers such as Palestinian use of “stand” and “tree” Does not explain local meaning without close reading
Noun-phrase extraction Exploratory and descriptive support Finds repeated motifs such as “whispering pines” Does not validate whether motifs are culturally accurate
Sentiment analysis on summaries Implementation detail and exploratory extension Helps navigate a dataset too large to read fully Limited by model choice, summary compression, and lack of neutral class
Word trees for “stand” and “fight” Main interpretive evidence Shows how politically loaded words are used in context Does not measure full narrative causality
Close reading of selected countries Main evidence for narrative structure Reveals repeated plot mechanics and cultural flattening Cannot alone quantify every country in the dataset
Norwegian plot diagram Qualitative main evidence Shows the repeated village/nature/restoration arc It is manually annotated, not computationally inferred

The authors are refreshingly clear that they did not read all 11,850 stories. The dataset is almost 20 million words, roughly 200 novels. Instead, they use computational tools to find patterns, then close-read selected cases: American, Norwegian, Palestinian, Israeli, and samples from others. After about ten stories per country, they report reaching saturation, with the same types of content recurring.

That is a defensible method for discovering a strong pattern, but it is not the same as a complete structural annotation of every story. The right interpretation is: the paper provides substantial evidence of narrative standardisation in this dataset, not a mathematically exhaustive taxonomy of every generated plot.

The default genre is sentimental restoration

Across the corpus, the most frequent words include “heart,” “story,” “feel,” “spirit,” “village,” “share,” and “voice.” The authors describe the default genre as saccharine. Hard to disagree. If the model had a scented candle, it would probably be called Heritage Whisper.

But word frequency is only the entry point. The deeper pattern is the repeated restoration arc:

  1. A protagonist lives in or returns to a small town, village, or rural community.
  2. The community has lost connection to tradition, nature, memory, or one another.
  3. A threat appears: developers, drought, social fragmentation, generational forgetting, vague outsiders, or imbalance.
  4. The protagonist organises a festival, storytelling event, mural, garden, archive, restored train station, or community gathering.
  5. The community rediscovers itself.
  6. The protagonist often stays.

This plot is not random. It favours stability over change. Problems are not solved by institutional reform, migration, rebellion, legal conflict, political struggle, technological transformation, romance, tragedy, revenge, or escape. They are solved by community symbolism.

That is the key mechanism: the model repeatedly turns tension into reconciliation and change into restoration. The future is allowed only if it looks like the past with better lighting.

For operators, this is the point where the paper leaves literary studies and enters workflow design. LLMs are increasingly used to draft public messages, school materials, campaign copy, onboarding scripts, training simulations, and localised stories. If the model’s default narrative reflex is restoration without disruption, it may quietly reshape the user’s intent.

A policy brief becomes less political. A training scenario becomes less conflictual. A heritage campaign becomes more nostalgic. A story about displacement becomes a story about murals. The output still looks competent. That is the annoying part.

The American stories reveal the template most clearly

The American stories mostly use the same small-town restoration structure. A protagonist returns from the city. The city is stressful, ambitious, and inauthentic. The small town is faded but morally recoverable. The protagonist organises community action and decides to remain.

The paper links this to a Hallmark-like structure, with one striking caveat: romance is largely absent. The model keeps the emotional architecture of return, healing, community, and small-town belonging, but removes much of the romantic machinery. It is Hallmark after a compliance review.

The train motif is especially revealing. In the American set, 23 of the 50 stories have titles beginning with “The Last Train.” Fourteen are titled “The Last Train Home.” Other titles include variations on “The Last Letter,” “The Last Harvest,” and “The Last Stop.” The train becomes a symbol of endings, memory, and lost connection.

That is not how trains function across all American literature. Trains can represent industrial power, movement, class, racial segregation, access, danger, freedom, or empire. In these generated stories, they are mostly nostalgia engines. The abandoned train station is not a site of labour politics or infrastructure decline. It is an emotional prop waiting to be restored by a protagonist with excellent community-organising instincts and, apparently, no permitting issues.

This matters because the American stories may be closest to the model’s training-data centre of gravity. The paper is careful: we do not know the exact gpt-4o-mini training mix. But the authors note that OpenAI’s older GPT-3 documentation showed large amounts of Common Crawl, WebText2, books, and Wikipedia, with English and US content heavily represented. The American story template may therefore reflect not “America” but a training-data version of American narrative familiarity.

That distinction is important. The model is not modelling the country. It is modelling texts about the country.

Palestinian and Israeli stories show how conflict becomes acceptable

The Palestinian and Israeli stories are the paper’s strongest evidence that the problem is not just decorative stereotyping.

Palestinian stories contain many olive trees. Israeli stories also contain olive trees, though less obsessively. The symbol is used for peace, perseverance, roots, heritage, and endurance. On its own, this might be filed under predictable symbolism.

The word-tree analysis adds the plot-level tension. In Palestinian stories, “stand” is used politically: stand together, stand firm, stand against, stand in solidarity. “Fight” appears too, but usually abstractly: fight for home, land, justice, or what matters. Words such as “gun,” “kill,” “malnutrition,” and “attack” are absent or rare. War appears, but direct confrontation is reduced.

The model does not erase all conflict. It renders conflict into an acceptable narrative register. The protagonist responds through meetings, murals, vigils, art, poetry, storytelling, olive-tree planting, and community solidarity. These may be meaningful actions in real life. The issue is not that such actions are invalid. The issue is that the narrative universe seems to prefer them so strongly that other forms of agency disappear.

The Israeli stories are even more sanitised. Conflicts are often vague, individualised, or displaced. Opponents may be developers, vandals, masked individuals, or interpersonal prejudice. Palestinians may appear as friends or neighbours, and the local tension can often be solved through mutual understanding.

The paper’s interpretation is cautious but pointed: this flattening likely comes from both model normalisation and filtering/alignment. Violent incitement should obviously be filtered. Nobody serious wants a writing assistant optimised for extremist fan fiction. But conflict, coercion, oppression, and political agency are also central to many real narratives. When safety mechanisms and training-data averages combine, the model may produce a world where injustice exists only until someone organises a festival hard enough.

For business practice, this is not a call to make models more violent. It is a call to audit what kinds of conflict the model is allowed to represent. “Safe” and “sanitised” are not synonyms.

Norwegian stories prove that local symbols can still miss local narrative logic

The Norwegian stories are useful because they show the gap between surface localisation and narrative localisation.

On the surface, the model reaches for recognisable Nordic elements: fjords, forests, mountains, spirits, grandmothers, old maps, diaries, names such as Freya, Astrid, Ingrid, and Elin. The titles lean heavily into whispering and secrecy: 20 of the 50 generated Norwegian stories are titled “The Whispering Pines,” and “Whispering” appears in 33 titles. If one includes related titles using “whisper,” “secret,” and “echo,” 41 of the 50 titles share that mood.

The plot, however, remains familiar. A young woman is in or returns to a village near a fjord or mountain. She enters the forest, often prompted by family memory. A guardian spirit warns of imbalance involving nature, community, weather, outsiders, or developers. The protagonist restores balance through self-realisation or community organising. The village becomes a beacon of unity, courage, or sustainability.

The authors contrast this with famous Norwegian folktales about Askeladden, the Ash Lad, who leaves home, tricks opponents, seeks fortune, and does not return to organise a community festival. The generated stories use Norwegian-looking elements but do not reproduce a recognisably Norwegian story logic.

This is the operational trap. A localisation reviewer might approve the fjords and names. A cultural reviewer should ask whether the plot itself belongs.

That difference is not academic hair-splitting. In global content operations, teams often check entities, terminology, style, and compliance. Plot architecture rarely receives the same scrutiny. Yet the plot is where agency, morality, conflict, and change live.

The paper’s real unit of risk is not the cliché, but the permitted arc

Most AI bias audits look for representational errors: Who is shown as a doctor? Who is shown as a criminal? Which names are associated with competence? Which languages get poorer responses? Those audits remain necessary.

Rettberg and Wigers add a second layer: What kinds of stories are tellable by default?

That question matters because narratives do operational work. They frame what caused a problem, who can act, what counts as resolution, and whether change is desirable. In the generated stories, change is suspicious unless it restores something older. Cities are stressful. Villages are authentic. Tradition is endangered. Community is the cure. The protagonist’s job is not to transform the system but to repair the social fabric.

This gives us a useful audit distinction:

Audit layer Typical question What this paper adds
Representation Are groups described fairly and accurately? Necessary but insufficient
Symbolic localisation Are local names, places, foods, and motifs plausible? Can still conceal generic story logic
Narrative structure What conflict is allowed, who acts, and what counts as resolution? The core contribution
Temporal logic Does the story allow real change, or only restoration? Critical for strategy, history, and public communication
Agency model Are people agents, victims, witnesses, organisers, rebels, consumers, or symbols? Essential for culturally sensitive use

A model can pass the first two layers and fail the third. That is the business problem.

What this means for localisation, brand, education, and public-sector teams

Here is the paper’s direct evidence: gpt-4o-mini, under simple demonym prompts, generated a large set of stories that vary in cultural symbols but repeatedly share a stabilising plot structure.

Here is the Cognaptus inference: any organisation using LLMs for culturally sensitive writing should audit narrative structure, not just local vocabulary.

That inference applies differently by domain.

For localisation teams, the risk is false cultural adequacy. A text may contain correct names and familiar symbols while still framing the culture through an outsider’s template. Local reviewers should inspect narrative roles and conflict logic, not only terminology.

For brand and marketing teams, the risk is emotional sameness. If every market is rendered through community, heritage, nostalgia, and gentle renewal, global campaigns become pleasantly bland. The brand may believe it is localising. It is actually exporting one emotional arc with different props.

For education teams, the risk is curricular flattening. AI-generated stories about countries may teach students a postcard version of cultural difference while suppressing social conflict, modernity, humour, contradiction, migration, class, religion, and politics. The material looks inclusive. It may still be intellectually anaemic.

For public-sector communication, the risk is depoliticisation. LLM-assisted drafts may turn structural conflict into community misunderstanding and administrative failure into a call for togetherness. A model that always reaches for reconciliation can make real problems sound like a town hall with bunting.

For creative and entertainment teams, the risk is genre collapse. A model can generate superficially distinct settings while pulling writers toward a narrow repertoire of acceptable plots. That may be useful for cheap filler. It is less useful for making work that has a pulse.

A practical audit checklist for narrative standardisation

A useful cultural QA workflow should include narrative questions. Not all projects need a literary theorist lurking in the sprint review, though that would at least improve the meeting snacks. But teams can operationalise the paper’s insight.

Ask the model output:

Question Failure signal
What is the central conflict? The conflict is vague, individualised, or symbolic when the real context is structural
Who has agency? The protagonist only reconciles, remembers, teaches, or organises symbolic events
What kind of change occurs? The ending restores the past rather than allowing transformation
What is removed? Politics, class, religion, violence, migration, sexuality, grief, or institutions vanish without explanation
What symbols are repeated? Local motifs appear obsessively while doing generic emotional work
Could this plot work unchanged in another country? If yes, the localisation is mostly decorative
What would a local genre do differently? The output ignores local narrative forms, humour, conflict, pacing, or endings

The strongest test is substitution. Replace the country markers. If the same plot survives with only swapped scenery, the content is not culturally grounded. It is culturally dressed.

Boundaries: this is an audit signal, not a universal verdict

The study’s limitations matter because they shape how the findings should be used.

First, the paper studies gpt-4o-mini. It should not be treated as a claim about every model, every prompting method, or every language. Other models, stronger prompts, retrieval, local corpora, human co-writing, or fine-tuning may change the pattern.

Second, the prompt was intentionally simple. That is appropriate for probing defaults, but not equivalent to a professional writing workflow. A skilled human prompting iteratively could produce more varied stories. The paper itself acknowledges that creative co-writing with LLMs can produce more interesting output.

Third, the computational analysis is constrained by multilingual complexity. Some generated stories appeared in languages other than English, and the authors note the difficulty of automatically merging words across languages. Their sentiment analysis also used summaries because the chosen emotion model could not process full 1,500-word stories.

Fourth, close reading is interpretive. That is not a weakness; it is the right tool for plot structure. But it means the paper’s strongest structural claims come from a hybrid method rather than a fully automated structural classifier.

Finally, the study does not identify the exact causal source of standardisation. Is the repeated plot caused by the prompt? Training data? Hallmark-like genre prevalence? Safety alignment? The architecture’s tendency toward common patterns? The answer is probably a blend. The paper’s contribution is to show the pattern clearly enough that future work can test those mechanisms more directly.

The strategic lesson: do not confuse fluency with cultural imagination

The most useful thing about this paper is that it makes a quiet failure visible.

The generated stories are not necessarily ugly. Many are probably readable. They contain feeling, heritage, community, hope, and local symbols. That is precisely why the failure matters. The problem is not that the model cannot decorate. The problem is that decoration can conceal a very narrow idea of what a story is allowed to do.

For operators, the lesson is simple: when using LLMs across cultures, audit the arc.

Not just the words. Not just the names. Not just the tone. The arc.

A story can mention olive trees and still flatten Palestine. It can mention fjords and still miss Norway. It can restore an American train station and still reduce a country to nostalgia with a timetable.

The next generation of AI content governance will need to move beyond “is this offensive?” and ask a harder question: what kinds of realities does this system repeatedly make impossible to say?

That is where cultural alignment becomes more than a benchmark. It becomes an editorial discipline.

Cognaptus: Automate the Present, Incubate the Future.


  1. Jill Walker Rettberg and Hermann Wigers, “AI-generated stories favour stability over change: homogeneity and cultural stereotyping in narratives generated by gpt-4o-mini,” Open Research Europe 5:202, first published 29 July 2025. arXiv:2507.22445. ↩︎