TL;DR for operators
Fine-tuning is not a washing machine. It may polish, redirect, or occasionally muffle a model’s behavioural tendencies, but this paper suggests that many cognitive-bias patterns are already substantially shaped before instruction tuning begins.
The study separates three possible sources of observed bias in large language models: the pretrained backbone, the instruction dataset, and random variation during fine-tuning. Its main finding is that models’ bias profiles cluster more strongly by pretrained model identity than by the instruction data used later. In plainer operational language: the base model carries a behavioural signature that survives downstream training.
This does not make fine-tuning irrelevant. It means fine-tuning is better understood as a modulation layer, not the origin story. If an enterprise AI team treats supervised fine-tuning, preference tuning, or prompt policy as the main place where bias risk is created and solved, it is arriving late to the meeting and then confidently taking minutes.
The governance implication is straightforward: bias evaluation should move upstream. Procurement teams should ask what base model is being adapted, what is known about its pretraining pipeline, whether benchmark results are being compared across genuinely comparable backbones, and whether post-training audits are measuring stable behavioural patterns or merely one convenient seed, checkpoint, or prompt wrapper.
The limitation is equally important. The paper studies specific open models, specific instruction datasets, and 32 cognitive-bias benchmarks. It does not prove that every behavioural risk in every frontier model is locked in by pretraining. But it does make one comfortable assumption harder to defend: that careful fine-tuning alone can reliably overwrite what pretraining has already taught the model to do.
The dangerous belief is that fine-tuning can clean up afterwards
A familiar enterprise workflow goes something like this: choose a strong base model, add company-specific instructions, run safety tuning, apply retrieval, wrap everything in policy prompts, and call the resulting system “aligned enough for production.” The mental model underneath is tidy. The base model supplies capability; fine-tuning supplies behaviour.
That division is convenient. It is also probably too optimistic.
The paper Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs by Itzhak, Belinkov, and Stanovsky asks a sharper question: when an instruction-tuned model displays cognitive biases, where did those biases actually come from?1 Were they created by instruction data? Were they random artefacts of fine-tuning? Or were they already latent in the pretrained model, waiting for downstream training to make them easier to observe?
The distinction matters because many business controls assume the downstream stage is where behaviour can be repaired. If the model’s bias profile is mainly inherited from pretraining, then post-training controls are still useful, but they are downstream controls. They can steer the symptom. They may not remove the source.
The paper’s contribution is not simply that “LLMs have cognitive biases.” That was already known. The more useful move is causal: the authors try to disentangle pretraining, instruction data, and training randomness as separate candidate causes. The article’s best reading therefore starts with mechanism, not scoreboard.
Latent bias is the thing fine-tuning makes visible
The paper’s causal picture separates two layers.
The first is latent bias: an internal behavioural tendency acquired during pretraining. This is not directly observed as a neat variable in the model’s weights, because neural networks remain, charmingly, unwilling to label their own organs. It is a conceptual construct: the bias-relevant structure the model has already absorbed before instruction tuning.
The second is observed bias: the measurable behavioural shift seen in outputs when the model is tested under bias-inducing conditions. For example, an anchoring-bias test compares a neutral prompt with a prompt containing an irrelevant numerical anchor, then checks whether the model’s answer shifts in the expected direction.
The authors then ask which training factor best explains observed bias:
| Candidate source | Mechanism in the paper’s framing | Expected experimental signature |
|---|---|---|
| Pretraining | Builds latent behavioural tendencies that later surface in outputs | Models with the same pretrained backbone should show similar bias profiles, even under different instruction datasets |
| Instruction data | Teaches or amplifies biased response patterns during fine-tuning | Models trained on the same instruction data should converge toward similar bias profiles |
| Fine-tuning randomness | Perturbs the observed expression of bias through seed-level training variation | Multiple fine-tuning runs should vary somewhat, but aggregation should recover stable tendencies |
This framing is useful because it prevents a common category error. An instruction-tuned model may show stronger bias than its base version, but that does not prove instruction tuning created the bias. Fine-tuning may have surfaced a pattern already present in pretraining. Seeing smoke after opening the oven does not mean the oven door invented combustion.
Cross-tuning is the paper’s causal lever
The authors need a way to distinguish “bias follows the instruction data” from “bias follows the pretrained model.” Their answer is cross-tuning.
They use OLMo-7B and T5-11B as the main controlled case-study models because their pretrained models, instruction data, and training recipes are sufficiently open for controlled experiments. OLMo is associated with Tulu-style instruction tuning; T5 with Flan. The authors first establish that OLMo and T5 show meaningfully different bias trends after instruction tuning, including contrasting behaviour on selected biases such as the certainty effect and belief-validity cases.
Then they swap the instruction datasets:
| Pretrained model | Original instruction data | Cross-tuned instruction data | What the comparison tests |
|---|---|---|---|
| OLMo | Tulu-2 | Flan | Does OLMo start behaving like a Flan/T5-family model, or does it retain OLMo-like bias patterns? |
| T5 | Flan | Tulu-2 | Does T5 start behaving like a Tulu/OLMo-family model, or does it retain T5-like bias patterns? |
This is a clean experimental idea. If instruction data is the dominant source of cognitive bias, then models trained on the same instruction dataset should group together. If pretraining dominates, then models sharing the same base model should remain behaviourally closer, even when the instruction dataset changes.
The authors repeat these fine-tuning settings across three random seeds, which matters because a single fine-tuning run can be a noisy witness. A seed is not a personality, despite the industry’s occasional efforts to treat every stochastic wiggle as an emergent soul.
The first test shows seed noise, not a new behavioural origin
Before cross-tuning can be trusted, the authors check whether fine-tuning randomness itself could explain the bias differences. They fine-tune the same pretrained models on the same instruction data using different random seeds and measure variation across 32 cognitive biases.
The purpose of this test is robustness and sensitivity, not the main causal claim. It asks: are cognitive-bias measurements so seed-sensitive that the later pretraining-versus-instruction comparison becomes meaningless?
The answer is mixed but usable. Bias scores vary across seeds, and in some cases the variation is not trivial. Bias measures are slightly more seed-sensitive than MMLU scores. That is already an important warning for evaluation teams: behavioural benchmarks are often noisier than capability benchmarks.
But the noise is not the whole story. When the authors aggregate across seeds, the bias patterns become more stable. The mean bias scores across seeds correlate with the original fully fine-tuned models at 0.49 for OLMo and 0.59 for T5. In more than two-thirds of evaluated cases, either majority vote preserves the original bias direction or the mean and median remain within the paper’s threshold for statistical insignificance.
So the correct interpretation is not “random seeds do not matter.” They do. The better interpretation is: random seeds affect observed bias expression, but aggregation can recover enough stable signal to examine deeper sources.
For business evaluation, that is a practical lesson. A one-run bias audit is a thin reed. If the system is high-stakes, seed-level or checkpoint-level variation should be treated as part of the measurement problem. Otherwise, the organisation may end up certifying an accident of stochastic training rather than a model behaviour.
The main evidence: bias vectors follow the base model
The paper’s central evidence comes from representing each model’s behaviour as a bias vector: a set of scores across the tested cognitive biases. The authors then ask how these vectors cluster.
They compare several grouping schemes:
- random labels;
- labels based on instruction dataset;
- labels based on pretrained model;
- unsupervised K-Means clustering.
The important result is that clustering by pretrained model produces more coherent and better-separated groups than clustering by instruction dataset. At the bias level, the pretraining-based clustering reaches a silhouette score of 0.104, compared with 0.028 for instruction-based clustering and 0.014 for random labels. The Calinski-Harabasz score shows the same pattern: 2.753 for pretraining versus 1.651 for instruction and 1.069 for random. Davies-Bouldin, where lower is better, also favours pretraining over instruction.
At the scenario level, where the representation expands from one score per bias to thousands of scenario-level scores, the same directional pattern remains: pretraining-based clustering beats instruction-based clustering on the reported quality metrics.
| Clustering label | Bias-level silhouette | Bias-level Calinski-Harabasz | Bias-level Davies-Bouldin | Interpretation |
|---|---|---|---|---|
| Random | 0.014 | 1.069 | 3.285 | Weak baseline |
| Instruction data | 0.028 | 1.651 | 2.648 | Some structure, but limited |
| Pretraining model | 0.104 | 2.753 | 2.036 | Stronger alignment with bias profiles |
| K-Means | 0.104 | 2.530 | 1.850 | Unsupervised structure closely matches pretraining identity |
This is the heart of the paper. The bias pattern is not best explained by which instruction dataset the model received. It is better explained by which pretrained model it started from.
The PCA visualisation reinforces this interpretation. The first principal component mostly separates models by pretraining identity; the second captures smaller instruction-related variation, especially within OLMo. That detail matters. Instruction tuning is not erased from the story. It just appears as a secondary source of variation, not the main organiser of the behavioural space.
The selected certainty-effect and belief-validity cases tell the same story in miniature. OLMo and T5 were chosen partly because they showed opposing post-tuning bias directions. After cross-tuning, most OLMo-based models retained the OLMo-like direction, and most T5-based models retained the T5-like direction. The finishing school changed. The accent remained.
The external models are validation, not a second thesis
The authors also test community-finetuned models based on Llama2-7B and Mistral-7B, using variants trained on Tulu-2 and ShareGPT. This extension is not as controlled as the OLMo/T5 setup. The models come from different community sources and training recipes, so the experiment loses some causal cleanliness.
That is exactly why the test is useful.
The controlled LoRA experiments ask whether the causal design works under known conditions. The community models ask whether the pattern survives in a messier environment closer to what practitioners actually download from model hubs.
The result mirrors the main finding. In the external models, clustering by pretraining identity again outperforms clustering by instruction dataset and random baselines. The scenario-level silhouette score is 0.096 for pretraining, compared with 0.014 for instruction and -0.001 for random. K-Means again aligns closely with pretraining identity.
This does not prove universality. It does strengthen external validity. The pattern is not merely an artefact of the OLMo/T5 LoRA setup.
The appendix tests robustness, not a separate argument
The paper’s appendices are worth reading because they clarify which claims are doing which job. Not every table is “more evidence” in the same sense. Some are implementation checks. Some are robustness tests. Some are sensitivity diagnostics.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Random-seed experiment | Robustness and sensitivity test | Bias scores vary across seeds, but aggregation recovers stable trends | That a single seed is reliable for behavioural certification |
| MMLU verification | Implementation validity check | LoRA fine-tuning approximates the original fine-tuned models well enough for the study’s purpose | That LoRA is identical to full fine-tuning in all internal mechanisms |
| Cross-tuning design | Main causal evidence | Bias profiles follow pretrained backbone more strongly than instruction data | That instruction data has no effect |
| Bias-vector clustering | Main evidence and measurement framework | Pretraining identity better organises observed bias patterns | That the exact source inside pretraining is known |
| PCA visualisations | Interpretive support | The largest variation separates base-model families | That two dimensions capture all relevant behavioural risk |
| Community Llama2/Mistral models | External validation | The pretraining-dominance pattern appears beyond the controlled OLMo/T5 setup | That the result applies to every open or closed frontier model |
| Statistical neutral zone for bias scores | Measurement discipline | Near-zero differences should not be overinterpreted | That all bias benchmarks are equally reliable |
This distinction matters for readers because the business takeaway should not be “look, many tables.” The useful point is that the paper builds a layered argument: first control for seed noise, then swap instruction data, then test whether the resulting behavioural geometry follows the base model, then check whether the pattern reappears in less controlled community models.
That is a stronger structure than a simple benchmark comparison. It also makes the conclusion harder to dismiss as “just another bias leaderboard.”
The misconception: fine-tuning is not behavioural reincarnation
The paper directly challenges a comfortable belief: that better instruction data can remake a model’s cognitive tendencies.
The replacement belief should be more precise. Fine-tuning can still matter. It can change the model’s response style, helpfulness, task compliance, refusal behaviour, and the surface expression of bias. In some cases it may amplify a bias; in others it may suppress it. The paper’s OLMo and T5 examples show that instruction tuning is not behaviourally inert.
But the study suggests that fine-tuning works against a pretraining-shaped substrate. The pretrained model already contains tendencies that downstream training may reveal, strengthen, weaken, or redirect. That is a different governance model.
Under the old model, bias mitigation can be treated mainly as a post-training problem: curate the instruction data, tune the reward model, write better policies, add red-team prompts, and deploy. Under the newer model, those steps remain necessary but insufficient. The question becomes: what behavioural tendencies did the base model already bring into the room?
This is not a philosophical distinction. It changes procurement, evaluation, and accountability.
What enterprises should actually do differently
The paper does not hand enterprises a complete mitigation recipe. It offers something more modest and more useful: a better diagnostic map.
| Paper finding | Cognaptus inference for business use | Practical action | Boundary |
|---|---|---|---|
| Bias profiles cluster more strongly by pretraining identity than instruction data | Base-model selection is a behavioural risk decision, not just a capability or cost decision | Compare candidate base models on domain-relevant bias tests before adaptation | The paper studies cognitive-bias benchmarks, not all safety or fairness risks |
| Random seeds introduce measurable variation | A single behavioural audit can be misleading | Run repeated evaluations across seeds, checkpoints, or sampling settings where feasible | This is easier for internal fine-tuning than for closed API models |
| Aggregation recovers stable tendencies | Governance should focus on patterns, not isolated prompt anecdotes | Track bias profiles as vectors across scenario families | Benchmark design quality still matters |
| Fine-tuning modulates but does not dominate bias patterns | Post-training controls should be treated as steering layers, not deep repairs | Use instruction data and policy prompts to reduce harmful expression, while auditing the base model separately | Some behaviours may be more fine-tuning-sensitive than the studied biases |
| Pretraining mechanisms remain underspecified | Transparency about pretraining is commercially valuable | Ask vendors for provenance, filtering, data mixture, and evaluation disclosures | Vendors may not know, disclose, or reliably measure all relevant details |
For internal model builders, the implication is direct: if cognitive bias matters for your use case, evaluate it before you invest heavily in downstream adaptation. Otherwise you may spend engineering budget teaching a model to speak politely while preserving the behavioural reflexes you should have screened earlier.
For buyers of model APIs, the implication is less convenient. You may not have access to seeds, checkpoints, training data, or pretraining recipes. That does not make the issue disappear. It means model risk teams should demand stronger vendor documentation, run model-family comparisons, and avoid treating fine-tuned product labels as if they were independent behavioural origins.
For compliance teams, the paper suggests a useful audit separation:
- Base-model audit: What cognitive-bias profile does the pretrained or vendor-supplied foundation model exhibit before task-specific adaptation?
- Adaptation audit: How does fine-tuning, retrieval, prompting, or tool use change the expression of that profile?
- Deployment audit: How does the application context amplify or dampen the remaining bias under real workflows?
Most organisations jump straight to the third layer because it is closest to production. Naturally. Fire alarms are easier to hear than electrical diagrams. But if the wiring is the issue, listening harder to the alarm is not a mitigation strategy.
This matters most where models advise under uncertainty
Cognitive bias is not merely a fairness category. It is a decision-quality category.
The biases in the benchmark include anchoring, framing, confirmation bias, conservatism, loss aversion, status quo bias, in-group bias, stereotyping, and planning fallacy. These are not exotic lab curiosities. They map uncomfortably well onto enterprise tasks: capital allocation, hiring support, procurement, market analysis, medical triage, investment commentary, and policy interpretation.
If a model is used to draft a marketing tagline, a mild framing sensitivity may be tolerable. If it is used to summarise litigation risk, recommend credit actions, evaluate suppliers, or support clinical decisions, the same class of tendency becomes more serious.
The paper’s result therefore strengthens the case for domain-specific behavioural evaluation before deployment. Capability benchmarks alone are insufficient. MMLU-like performance may stay stable while bias scores vary more across seeds. A model can be knowledgeable and still behave poorly under decision pressure. This is not shocking. Many humans have built entire careers on the same combination.
The boundary: do not overread the result
The practical conclusion should be strong, but not theatrical.
First, the main controlled experiments use two base-model families, OLMo-7B and T5-11B, and two instruction datasets, Tulu-2 and Flan. That is a deliberate design choice, not a census of the model universe.
Second, the controlled fine-tuning uses LoRA and a downsampled Flan dataset for feasibility. The authors verify that their fine-tuned models recover much of the original performance improvement and preserve relevant bias-score trends, but LoRA is still an approximation. Full fine-tuning could alter internals differently in some settings.
Third, the external Llama2/Mistral models improve generalisability but reduce control. They are useful because they are messier, but that messiness limits causal precision.
Fourth, the paper studies cognitive-bias benchmarks. It does not cover every alignment property, social harm, jailbreak behaviour, calibration failure, or domain-specific compliance risk. A result about anchoring and framing should not be lazily stretched into a universal theory of all model behaviour. Laziness, regrettably, scales.
Finally, the mechanism inside pretraining remains unresolved. The paper shows that pretraining identity is a dominant organiser of observed bias patterns. It does not identify exactly which pretraining factors are responsible: corpus composition, filtering, sampling, tokenisation, linguistic framing, domain mixture, or architecture-related interactions. That is the next diagnostic layer.
The governance lesson is upstream accountability
The old operational story was attractive because it made governance feel tractable: take a powerful model, fine-tune it responsibly, red-team it, and deploy with controls.
This paper makes the story less tidy. Bias may be planted earlier, then swayed later. The metaphor matters: swaying is not rewriting.
For AI builders, the lesson is to measure base-model behaviour before fine-tuning, not after all adaptation choices have blurred the trail. For AI buyers, the lesson is to treat pretraining provenance as a risk variable, not a footnote. For governance teams, the lesson is to separate latent behavioural tendencies from observed deployment behaviour, because the latter may be only the visible edge of a much older training history.
Fine-tuning still matters. Prompting still matters. Application design still matters. But if the model’s behavioural grain is set upstream, downstream work becomes a negotiation with pretraining, not a clean-sheet redesign.
The uncomfortable conclusion is also the useful one: the earlier a bias enters the pipeline, the more expensive it becomes to diagnose later. Conveniently, this is also how enterprise software architecture works. The industry has simply rediscovered technical debt, but with more GPUs.
Cognaptus: Automate the Present, Incubate the Future.
-
Itay Itzhak, Yonatan Belinkov, and Gabriel Stanovsky, “Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs,” arXiv:2507.07186, https://arxiv.org/abs/2507.07186. ↩︎