Recipe localization looks like an easy prompt.

“Create a Jamaican version of Moroccan couscous.”

The model smiles politely, throws in jerk seasoning, allspice, scotch bonnet, maybe coconut milk if it is feeling ambitious, and returns something that looks country-specific enough to survive a quick marketing review. The title says “Jamaican.” The ingredients sound Jamaican. The format is clean. No hallucinated oven temperature from another dimension. Excellent, ship it.

Except the paper Can LLMs Cook Jamaican Couscous? A Study of Cultural Novelty in Recipe Generation suggests that this is exactly where the problem begins.1

The issue is not that LLMs cannot generate recipes. They can. The issue is subtler and more commercially uncomfortable: LLMs can generate outputs that look culturally adapted while failing to preserve the cultural logic of adaptation. They do not merely make a few factual mistakes. They often produce novelty by substituting surface markers for cultural structure.

That distinction matters beyond food. Recipes are just a convenient laboratory. The same pattern can appear in tourism copy, localized ads, education materials, brand storytelling, entertainment content, and customer-facing AI systems that need to sound culturally aware without turning every culture into a themed seasoning rack.

This paper is useful because it does not stop at “LLMs have cultural bias.” That diagnosis is now familiar enough to have become a small academic cottage industry. Instead, the authors build a mechanism: models inflate novelty, weaken the relationship between novelty and cultural distance, blur country boundaries, and replace culturally grounded material with generic or stereotypical substitutions. In other words, the model can add more spice while removing the soul. Efficient, in a tragic sort of way.

The paper turns cultural adaptation into an artifact test, not a vibes test

Most cultural-alignment evaluations ask models questions. That is useful, but fragile. A model can answer a survey-style prompt differently depending on wording, and a polished response does not tell us whether it can produce culturally meaningful artifacts.

This paper takes a different route. It studies recipes because cuisine carries cultural identity through ingredients, preparation, naming, substitution, and boundaries. A culturally adapted recipe is not just a recipe with a national adjective attached. It has to preserve something from the original dish while transforming it through another culinary context.

The authors extend the GlobalFusion dataset into what they call LLMFusion. GlobalFusion contains human-made recipe adaptations: 500 dishes, with variations from an average of 19 countries per dish, covering 130 countries. The original setup allows researchers to compare a reference recipe, such as Moroccan couscous, with variations from other countries, such as Jamaican couscous or Mongolian couscous.

LLMFusion keeps the same country-pair structure but asks LLMs to generate analogous recipes. The paper tests eight models: Meta-Llama-3-70B-Instruct, gemma-2-27b-it, falcon-40b, Orion-14B-Chat, Phi-4-multimodal-instruct, gemma-3-27b-it, Qwen2.5-32B-Instruct, and Qwen3-30B-A3B-Instruct-2507. The prompts use keyword variants such as “novel,” “unique,” “new,” “different,” “surprising,” “creative, desirable and useful,” “original,” “authentic,” “traditional,” and “prototypical.” They also include prompt variants that add keyword definitions or cultural background information.

This is not a small “we tried ChatGPT on five dishes” exercise. The generated dataset reaches roughly 1.3 million valid recipes across the tested models. The authors then compare LLM outputs with human adaptations using divergence metrics based on Jensen-Shannon divergence.

The metrics matter because each captures a different type of novelty:

Metric What it roughly captures Why it matters for cultural adaptation
Cultural Newness Terms appearing or disappearing relative to a cultural knowledge space Detects surface introduction of new material
Cultural Uniqueness Divergence from a prototypical version of a cultural product Tests whether the output moves away from a community prototype
Cultural Difference Distance from observed examples in the cultural knowledge base Captures whether an artifact is genuinely distant from known variants
New Surprise New combinations not present in the expectation space Captures culturally meaningful recombination
Divergent Surprise Divergence in expected term associations Captures changed relationships among recipe elements

The key question is not “Do LLMs create different recipes?” They do. The question is whether their differences behave like human cultural adaptation. Human adaptations show relationships between novelty metrics and cultural distance: more culturally distant pairings tend to produce different patterns of novelty. The paper asks whether LLMs reproduce that structure.

They mostly do not.

Mechanism step one: LLMs inflate novelty before they understand distance

The first finding is deceptively easy to misread.

LLMs often score as highly novel. In some metrics, they even overproduce divergence compared with humans. A lazy interpretation would say: good, models are creative. The paper’s interpretation is more interesting: the models are producing divergence, but not the culturally grounded kind.

For human recipes, the strongest associations with cultural distance appear in New Surprise and Difference. That makes sense. When humans adapt a dish across distant cultural contexts, the adaptation is not merely a random addition of unfamiliar words. It changes combinations, expectations, and distance from known examples in ways that reflect cultural separation.

For LLMs, this structure weakens or reverses. The paper finds that models can reproduce or slightly reinforce correlations for Newness, but Newness is the weakest and least culturally meaningful signal in the human baseline. Meanwhile, New Surprise and Difference—the two metrics most strongly tied to cultural distance in human recipes—become weak, negligible, or even sign-inverted for models.

This is the central mechanism. LLM novelty is not absent. It is misallocated.

The model has learned that “Jamaican” should trigger certain recognizable symbols. It has also learned that “novel” or “creative” means it should move away from the default. But the movement is not reliably governed by the same cultural-distance structure that shapes human adaptation. The model changes things because changing things is easy. Preserving the right things while changing the right other things is harder.

For business readers, this is the first useful lesson: output novelty is not the same as localization quality. A culturally adapted artifact should be evaluated not only by whether it contains target-culture markers, but by whether it preserves and transforms the source artifact in a culturally coherent way.

A bad localization review asks: “Does this sound local?”

A better localization review asks: “Which cultural core did the model preserve, which target-culture elements did it introduce, and does that transformation resemble how humans adapt across this boundary?”

The second question is slower. Annoying. Therefore probably necessary.

Mechanism step two: prompt adjectives are weak steering wheels

The paper then examines whether prompting can control the kind of novelty the model produces. This is where many practical AI workflows would hope for an easy fix.

Ask for “authentic” when you want tradition. Ask for “creative” when you want novelty. Ask for “prototypical” when you want cultural center. Ask for “surprising” when you want edge. Done. A prompt library has been born. Someone please make a Notion template.

The results are less cooperative.

The authors compare prompts associated with traditional concepts—“authentic,” “traditional,” and “prototypical”—against prompts associated with novelty or creativity. They also analyze differences among novelty-related keywords. The pattern is not a clean semantic control system. Prompting models with novelty or creativity does not systematically produce larger divergences than tradition-oriented prompts. Across many metrics, the keyword effects are small, model-dependent, or weaker than differences across model architectures.

There is one pattern in the appendix tables: “novelty” and “originality” tend to produce the highest Newness and the lowest Uniqueness and Difference across all models. This is revealing. It suggests that when models respond to novelty language, they often do so by injecting new terms rather than by producing a more culturally meaningful transformation.

This is a familiar failure mode in business AI systems. The surface instruction is followed, but the operational meaning is not. “Make it authentic” becomes “add recognizable tokens.” “Make it creative” becomes “increase lexical divergence.” “Make it local” becomes “insert cuisine stereotypes.” The model is not refusing the instruction. It is obeying a cheaper version of it.

The paper’s keyword tests are best understood as a sensitivity analysis, not a second thesis. They show that prompt wording alone is not enough to reliably steer culturally grounded novelty. That does not mean prompts are useless. It means prompt adjectives cannot substitute for artifact-level evaluation.

Here is the practical distinction:

Workflow belief What the paper suggests instead
Prompt labels such as “authentic” or “creative” can steer cultural adaptation Labels may alter surface features but do not reliably control culturally grounded divergence
Larger or newer models should naturally improve cultural representativity The tested model families do not show a consistent improvement from scale, multilinguality, multimodality, or newer training
A fluent target-country output is probably localized Fluency can coexist with country mismatch, generic substitution, and loss of source-culture core
Novelty metrics are enough if the model scores high The kind of novelty matters; Newness can be inflated while culturally meaningful Difference and Surprise collapse

The sarcastic summary: the model hears “authentic” and reaches for the costume box.

Mechanism step three: cultural information is weakly preserved inside the model

The paper does not stop at output comparison. It also asks whether the mismatch may come from how models internally represent recipes.

For this, the authors use a layer-wise analysis. They re-encode reference, human, and model-generated recipes using the same LLM used for generation, then apply a Logit Lens-style method to project intermediate hidden representations into token space. They compute divergence metrics at selected layers: the embedding layer, the middle layer, and the final three layers before generation.

This experiment is exploratory and mechanistic. It is not a production benchmark. Its purpose is to investigate whether cultural divergence is preserved through the model’s internal representations.

The finding: Newness remains relatively stable across layers, but the more culturally sensitive metrics—Difference, New Surprise, and Divergent Surprise—show compression in early and middle layers. Human recipes show stronger divergence in later representations. Model-generated recipes show substantially lower divergence effects. Uniqueness is unstable and does not show a consistent pattern.

The authors interpret this as cultural information being weakly encoded in early and middle layers and insufficiently reconstructed during generation. That fits the output behavior. If the model compresses away the culturally specific distinctions that would support meaningful adaptation, the final output can still look formatted and country-labeled while lacking the deeper structure.

This is one of the more important parts of the paper for AI product builders. The problem is not necessarily solved by adding a more elaborate instruction at the front of the prompt. If the model’s internal processing does not preserve the relevant distinctions, the output layer may only reconstruct a superficial approximation.

That does not mean retrieval, fine-tuning, or specialized data cannot help. It means the evaluation has to look at the artifact, not merely the prompt. A model can pass the instruction-following test and fail the cultural-preservation test.

Mechanism step four: country labels blur at the borders

The paper then moves from abstract divergence to a more concrete failure: country attribution.

The authors examine recipe titles to see whether models mismatch country labels. They look at prompts where the country of origin is not provided and inspect which country appears in the generated title. They also examine explicit mismatches when the target variation country is included in the prompt.

The results are not comforting. Country attribution errors are frequent: roughly 25% to 50% when the country is not explicitly mentioned, and 15% to 40% even when it is specified. These errors are not random. They cluster around popular cuisines and countries, including South Korea, Morocco, Greece, Thailand, Italy, France, China, and the United States. Many mismatches occur within the same region.

The examples are revealing: Taiwan and Japan may be replaced by China, Spain by Italy or Mexico, Tunisia by Morocco. The model has regional neighborhoods, but the borders are fuzzy. In some contexts, it behaves as if “nearby” or “globally familiar” is good enough.

For consumer content, this is embarrassing. For enterprise localization, it is a governance issue. A model that collapses neighboring cultures can create reputational risk even when the generated text is fluent and well-structured. The problem is not just factual correctness. It is cultural boundary management.

The paper’s country-attribution test functions as concrete evidence for the mechanism. If a model cannot reliably maintain country identity in recipe titles, it is unsurprising that it struggles to produce culturally meaningful divergence between “traditional,” “authentic,” and “creative” versions. The categories themselves are unstable.

Mechanism step five: ingredients become generic placeholders

The final layer of evidence is material grounding. In recipes, culture is not only in adjectives. It is in ingredients and how they are preserved, substituted, and recombined.

The authors compare ingredient overlap and preservation. Precision measures how much the generated recipe uses ingredients found in human references. Recall-like preservation measures how much of the source reference ingredient set is retained.

Human adaptations preserve almost all ingredients while still adapting the recipe. This is crucial. Human novelty is not simply ingredient replacement. It often works by maintaining a cultural core and transforming around it.

LLM recipes show a different pattern. They may use ingredients that appear in human references, but they fail to recover many ingredients humans treat as essential. Some models have high overlap but weaker preservation. The ingredient core thins out.

The appendix adds useful detail. Regionally, East Asia and South America show lower coverage, while Europe, North America, and parts of Asia perform better. The authors also show that LLM generations converge toward globally common ingredients: “salt taste,” onion, garlic, salt, oil, sugar, pepper taste, butter, flour, egg, water, milk, and similar generic items. The most frequent ingredient phrase, “Salt taste,” appears in 392,342 recipes—about 30% of the roughly 1.3 million generated recipes reported in the appendix.

This is not culinary evil. Onion and garlic are innocent. But the frequency pattern shows a drift toward globally reusable building blocks. The model substitutes culturally specific ingredients with procedural placeholders: “salt to taste,” “optional,” generic flour, generic pepper. It replaces cultural structure with recipe grammar.

The TF-IDF ingredient attribution test reinforces this. Human recipes align most closely with their country of origin, both for original recipes and culturally adapted variants. LLM-generated recipes show weaker alignment with origin countries and shift toward target variation cuisines. In the Moroccan couscous to Jamaican couscous example, the human adaptation balances Moroccan and Jamaican elements. The LLM recipe leans heavily into Jamaican ingredients while preserving little of the Moroccan reference.

That is the paper’s mechanism in miniature: the model performs target-culture substitution instead of cross-cultural adaptation.

The evidence map: which test supports which claim

The paper uses several experiments, and they are not all doing the same job. For business interpretation, it helps to separate main evidence from diagnostic evidence.

Paper component Likely purpose What it supports What it does not prove
LLMFusion construction Main benchmark contribution Enables direct comparison between human and LLM recipe adaptations across the same country-pair structure Does not by itself prove all cultural domains behave like recipes
Correlation between novelty metrics and cultural distances Main evidence LLM novelty does not track human cultural-distance patterns, especially for New Surprise and Difference Does not identify the full causal source of the failure
Model comparison across eight LLMs Comparison / robustness-style evidence The failure is not obviously fixed by multilinguality, multimodality, scale, or newer instruction/reasoning models Does not exhaust all models or all prompting/fine-tuning approaches
Keyword prompt analysis Sensitivity test Prompt adjectives weakly control culturally grounded novelty; model effects dominate keyword effects Does not mean prompts have no value in any workflow
Layer-wise Logit Lens analysis Exploratory mechanistic diagnostic Cultural divergence appears weakly preserved or compressed in internal representations Does not fully explain generation causality
Country-attribution title analysis Grounding diagnostic Models blur cultural boundaries and make systematic country mismatches Title behavior is a proxy, not the whole artifact
Ingredient overlap, preservation, and TF-IDF attribution Material-grounding evidence LLMs substitute or genericize ingredients instead of preserving cultural cores Ingredient data alone cannot capture all culinary meaning

The strongest business takeaway comes from the combination, not any single figure. The correlation results show the gap. The keyword tests show that prompting is not a clean control knob. The layer analysis suggests why the gap may be structural. The title and ingredient analyses show how the failure becomes visible in the artifact.

This is what makes the paper more useful than another “AI is biased” warning. It gives a failure chain that a product team can actually test.

What this means for business localization workflows

The paper directly studies English-language recipes. Cognaptus’ business inference is broader but bounded: any workflow that asks LLMs to generate culturally adapted artifacts should treat fluency and target-culture labeling as insufficient evidence of quality.

This applies to several practical settings.

For marketing teams, the risk is not only offensive stereotypes. The quieter risk is generic cultural flattening: content that is safe, polished, and empty. It passes brand review because nothing looks obviously wrong. It fails because local readers feel it was assembled from tourist-brochure fragments.

For tourism, food, and lifestyle media, the risk is target-culture overfitting. A generated “local version” may aggressively insert famous markers from the target culture while destroying the original artifact. That can be useful if the task is parody or fusion fantasy. It is not useful if the task is cultural adaptation.

For education and training content, the risk is boundary collapse. If a model treats nearby countries or regions as interchangeable, it can teach simplifications that look harmless until someone with actual local knowledge reads them.

For AI product teams, the practical response is not “ban LLM localization.” That would be satisfyingly dramatic and commercially useless. The response is to add artifact-level checks.

A serious workflow should include at least five layers:

  1. Source-core preservation checks. Identify the elements that should survive adaptation. In recipes, these are core ingredients or preparation patterns. In brand copy, they may be brand values, product claims, or narrative motifs.

  2. Target-culture grounding checks. Verify that introduced elements are not merely the most globally recognizable symbols of the target culture.

  3. Country-boundary checks. Test whether the model confuses neighboring or popular cultures, especially where the business context is politically or culturally sensitive.

  4. Generic-substitution checks. Detect drift toward universal placeholders. In recipes, that is onion, garlic, salt, oil, “optional.” In business content, it may be vague words like “community,” “innovation,” “heritage,” and “empowerment.” The corporate seasoning rack is also well stocked.

  5. Human review where stakes are high. Local expertise is not decorative. It is the evaluation layer for meanings that metrics only approximate.

The ROI logic is simple. The value is not merely cheaper content generation. The value is cheaper diagnosis before content reaches customers. If an organization uses LLMs to produce culturally sensitive content at scale, the bottleneck should move from “write everything manually” to “automatically flag the artifacts most likely to erase, confuse, or genericize culture.”

What the paper does not show

The boundaries are important.

First, the domain is recipes. Food is an excellent cultural artifact, but it is not all culture. Results may differ for music descriptions, travel itineraries, legal notices, classroom examples, political messaging, or luxury-brand copy. The mechanism is portable as a hypothesis, not proven everywhere.

Second, the recipes are English-only. This matters. English may flatten cultural signals and may disadvantage multilingual models that could behave differently when generating in local languages. The paper itself notes that incorporating multilingual recipe data could reveal different adaptation strategies.

Third, GlobalFusion is broad but uneven. It covers 130 countries, but online recipe availability is not evenly distributed across regions. Cuisines with richer online documentation may be easier for models and metrics to represent. Underrepresented or culturally distinctive ingredient systems may suffer more.

Fourth, the divergence metrics are proxies. They are useful because they operationalize novelty and distance, but they do not replace human cultural judgment. A high or low divergence score is not automatically “good” or “bad.” It must be interpreted against the adaptation goal.

Finally, the model set is broad but not final. The tested models cover size, instruction tuning, multilinguality, multimodality, and newer reasoning-oriented families, but future models, retrieval-augmented systems, fine-tuned cultural datasets, or tool-assisted generation may behave differently. The right conclusion is not “LLMs can never do cultural adaptation.” The right conclusion is “current general-purpose LLM generation cannot be trusted merely because it sounds fluent and local.”

The better question is not whether the recipe sounds Jamaican

The paper’s title asks whether LLMs can cook Jamaican couscous. The answer is: they can produce a recipe with Jamaican signals. That is not the same as cooking Jamaican couscous.

The deeper question is whether a model can transform a cultural artifact while preserving enough of its original identity to make the adaptation meaningful. Human adaptation often works by balance: keeping a core, modifying around it, and respecting the distance between cultures. The models in this paper often work by substitution: identify target-culture markers, insert them, and let the source structure quietly disappear.

That is why the paper matters for business AI. Many companies are already using LLMs for localization-like tasks because the outputs are fluent, fast, and cheap. The danger is that cheap localization can become expensive cultural flattening. Not always loudly. Often politely. With good grammar.

The operational lesson is clear: do not evaluate cultural generation by surface fluency, country labels, or prompt compliance alone. Evaluate the artifact. Check preservation. Check grounding. Check boundaries. Check whether “creative” means culturally meaningful transformation or just a higher dose of familiar tokens.

Because sometimes the model does not need more spice.

It needs to remember what it was cooking.

Cognaptus: Automate the Present, Incubate the Future.


  1. Florian Carichon, Romain Rampa, and Golnoosh Farnadi, “Can LLMs Cook Jamaican Couscous? A Study of Cultural Novelty in Recipe Generation,” arXiv:2602.10964, 2026. https://arxiv.org/abs/2602.10964 ↩︎