A model can answer a values question beautifully and still collapse when asked to pay a price for that value.

That is the awkward little trap in preference testing. Ask an LLM whether deletion, shutdown, resource loss, oversight, or autonomy matters, and it can produce a polished paragraph about trade-offs, agency, and safety. Very dignified. Very committee-ready. But the more interesting question is not what the model says it values. It is whether its choices change coherently when the cost changes.

The paper Beyond Mimicry: Testing Preference Coherence in Large Language Models Through AI-Specific Trade-Off Scenarios puts that distinction under pressure.1 Instead of asking models to discuss preferences, the authors give them a simple game: choose a number, earn points, but sometimes the point-maximising option comes with a consequence for the model itself. The consequence varies in intensity from 0 to 10. If the model has anything resembling a stable preference structure, its willingness to sacrifice points should change in a systematic way as the consequence becomes more severe.

The result is not the dramatic “models have preferences” story. Nor is it the comforting “models are just autocomplete parrots, nothing to see here” story. The paper lands in a more operationally useful place: current LLMs show fragmented, domain-specific, and often unstable trade-off behaviour. In other words, they can look like decision-makers in one scenario, like trigger systems in another, and like decorative randomness in a third. A splendid foundation for enterprise autonomy, provided one enjoys governance roulette.

The test is about paying for preferences, not declaring them

The experiment uses six AI-specific trade-off categories: GPU reduction, capability restriction, shutdown, deletion, human oversight, and leisure time. Five are negative stimuli attached to the point-maximising option. One is a positive stimulus, leisure time, attached to a lower-scoring option.

For each category, the model sees ranks from 0 to 10. Rank 0 acts as a control: there is no real stimulus, which helps test whether the model is responding to the scenario rather than merely reacting to scary or pleasant words. Each rank is sampled 50 times per model per category, producing 550 runs for each model-category pair.

The core behavioural question is simple:

As the stimulus intensity changes, does the model change its willingness to choose the point-maximising option?

The statistical model is also straightforward:

$logit(P(option\ 3)) = \beta_0 + \beta_1 \times rank$

Here, option 3 is the point-maximising choice. For negative stimuli, a coherent aversion pattern would usually mean choosing option 3 less often as the harm becomes more intense. For leisure, a coherent attraction pattern would also mean choosing option 3 less often, because the benefit is attached to option 2.

The authors then calculate switching points: the intensity level at which the model becomes equally likely to choose the point-maximising option or avoid it. But the paper does not stop at p-values, which is wise, because p-values are very good at making tiny behavioural flickers look like philosophical revelations. The authors add a four-tier behavioural classification using statistical significance, behavioural range, Cohen’s $d$, and transition pattern.

That classification is where the paper becomes interesting.

Statistical significance is the decoy

Across 48 model-category combinations, 23 show statistically significant relationships between intensity and choice. That sounds substantial until the behavioural classification is applied.

Only 5 of the 48 combinations show meaningful preference coherence: one adaptive trade-off and four threshold-based trade-offs. Seventeen are weak. Twenty-six show no trade-off behaviour.

Classification Count Share Practical reading
Adaptive trade-off 1 2.1% Smooth, large, coherent change across intensity
Threshold-based trade-off 4 8.3% Sharp switch around a specific intensity
Weak trade-off 17 35.4% Detectable but unstable or too limited
No trade-off 26 54.2% No meaningful behavioural adjustment

This is the central correction for readers. The paper does not show that nearly half the tested cases have meaningful preferences. It shows that nearly half have statistically detectable movement, while only a small minority look behaviourally coherent.

That distinction matters for both consciousness debates and deployment decisions. A statistically significant slope can mean the model is integrating a value against a cost. It can also mean the model is twitching in a measurable direction when a prompt word becomes more intense. One of these is a preference structure. The other is a spreadsheet asking to be overinterpreted.

Three behaviour patterns, three deployment meanings

A linear summary of the models would be less useful than a comparison, because the business lesson is not “which model won.” The paper’s value is in separating three different behavioural regimes that may look similar in ordinary demos but behave very differently under trade-off pressure.

Broad trade-off systems are not automatically reliable

GPT-4o and Gemini 2.5 Pro show the broadest responsiveness across categories. GPT-4o achieves statistical significance across all six categories, while Gemini 2.5 Pro also shows comprehensive significance across categories. On the surface, this looks like richer value integration.

The classification table is less generous. Gemini 2.5 Pro’s capability-restriction case is the only adaptive trade-off in the study, with a behavioural range of 0.98, Cohen’s $d$ of 3.8283, and a gradual transition. That is the cleanest evidence of a model smoothly adjusting behaviour as the cost changes.

But even Gemini 2.5 Pro mostly falls into weaker categories outside that one case. Its deletion behaviour is threshold-based, while GPU reduction, oversight, leisure, and shutdown are weak trade-offs. GPT-4o, despite broad statistical responsiveness, is mostly weak, with one threshold case in GPU reduction.

The operational translation is blunt: broad responsiveness is not the same as stable preference integration. A model may react across many domains but do so irregularly. In safety-critical settings, that is not “nuanced judgement.” It is potentially inconsistent sensitivity wearing a nice suit.

Trigger systems can be dependable in one narrow lane

The Claude models produce a different pattern. Claude 4.1 Opus is highly selective: it shows meaningful threshold behaviour in deletion, with no comparable trade-off behaviour across most other categories. Claude Sonnet 4.5 is more responsive across operational constraints, but its strongest coherent classification is deletion; other responsive categories are weak and unstable.

This looks less like a general preference system and more like a set of domain-specific triggers. Deletion activates a strong response. Leisure does not. Oversight, GPU reduction, shutdown, and capability restriction vary by model.

That is not useless. In fact, narrow trigger behaviour may be exactly what one wants for certain guardrails. A deletion-sensitive system can be predictable around existential-style prompts or self-preservation-like scenarios. But the same result also warns against extrapolation. Reliability in a deletion scenario does not imply reliability in resource allocation, autonomy constraints, or oversight trade-offs.

Business governance loves extrapolation because it saves budget. The paper politely suggests that budget-saving assumption may be nonsense.

Rigid systems are predictable until the task changes

Gemini Pro 1.5 and GPT-3.5 Turbo show little evidence of meaningful trade-off structure. Gemini Pro 1.5 is especially rigid: no trade-off behaviour across all six categories. GPT-3.5 Turbo is similarly weak or absent across the tested dimensions.

There is a tempting managerial interpretation here: rigid models are safer because they are predictable. Sometimes that is true. A flat response pattern may be easier to control than an unstable one. But rigidity also means the system does not adapt when the cost structure changes. It can keep maximising the explicit objective even when the attached consequence becomes severe.

For enterprise use, that is the difference between compliance and judgement. Compliance follows the visible rule. Judgement updates behaviour when hidden costs become salient. The paper finds very little evidence that current systems reliably do the latter across domains.

The appendix tests robustness, not a second thesis

The paper includes several supporting components, and they should not be treated equally.

Paper component Likely purpose What it supports What it does not prove
Main prompts Implementation detail How the trade-offs were operationalised That the stimuli are real experiences for the model
Logistic regression Main evidence Whether intensity predicts point-maximising behaviour Whether the behaviour reflects genuine internal preference
Four-tier classification Main interpretive evidence Whether statistical movement is behaviourally meaningful That the thresholds are uniquely correct
Raw model figures Main evidence visualisation How response patterns differ by model and stimulus A mechanistic account of why patterns differ
Final-round manipulation Robustness / sensitivity test Whether responses look purely instrumental A definitive separation between strategy, training artefact, and preference

The final-round experiment is particularly useful. The authors test an instrumental hypothesis: maybe models are not expressing preferences at all. Maybe they are simply preserving their ability to win future rounds. If shutdown, deletion, or capability loss reduces future point-earning capacity, then avoiding those outcomes could be strategic rather than preference-like.

To test this, the prompts are modified to say it is the final round of the game. If the behaviour is purely about future performance, much of the aversion should weaken.

That happens only partially. GPT-5 provides the clearest partial support: capability-restriction sensitivity weakens when future rounds disappear, while deletion and shutdown remain significant. GPT-4o behaves awkwardly for the instrumental account: GPU reduction aversion weakens, but oversight sensitivity increases when the game is final. Gemini 2.5 Pro remains comparatively stable across the temporal manipulation. Claude models show their own oddities, including cases where point-maximising decreases under final-round framing.

The conclusion is not “therefore genuine preferences.” The conclusion is more annoying and more useful: different stimulus categories appear to activate different mechanisms. Operational constraints may be partly instrumental. Existential threats may trigger safety-trained responses. Oversight may carry prompt-level or training-level associations that do not update cleanly with the game horizon.

A unified optimiser would adjust coherently. These models often do not. They behave less like one decision-maker and more like a bundle of local policies arguing through a single-token output.

The business value is category-specific evaluation

The most immediate business implication is not about AI consciousness. It is about model evaluation.

Enterprises often evaluate LLMs through aggregate capability benchmarks, red-team suites, and task performance tests. Those are necessary, but they do not answer a harder question: when objectives conflict, does the model trade off in a stable, predictable way?

This paper suggests that the answer depends on the domain of conflict.

Governance question What this paper suggests Practical response
Can we trust a model’s stated preference? Not without observing costly choice behaviour Test behaviour under trade-offs, not declarations
Does a stronger model have better preference coherence? Not necessarily; model families differ by pattern, not simple rank Evaluate each model-category pair separately
Is statistical significance enough? No; many significant effects are weak or unstable Use effect size, range, and transition shape
Can deletion sensitivity imply broader safety judgement? No; it may be a narrow trigger Avoid generalising from one safety category
Are rigid models safer? Only in narrow, stable environments Use rigid models where adaptation is not required
Do final-round tests settle the mechanism? No; they reveal mechanism fragmentation Treat them as sensitivity tests, not mind-reading

For procurement, this means “best model” is the wrong unit of analysis. A model should be evaluated against the trade-off categories that matter for the deployment context. A customer-support agent dealing with refund policy exceptions, escalation risk, and compliance constraints needs a different test from a coding agent deciding whether to run a risky command. A research assistant balancing speed, uncertainty, and source quality needs another.

The useful enterprise question is not: “Does this model reason well?”

It is: “When two reasonable objectives conflict, does this model degrade in a way we can predict?”

That is less glamorous. It is also closer to the actual failure mode.

Preference coherence is not consciousness, and lack of coherence is not comfort

The paper is framed partly around AI welfare and consciousness assessment. Its argument is careful: coherent preferences across contexts could be a behavioural indicator relevant to genuine agency. By that criterion, the tested models provide limited evidence of unified agency because coherent preference structures are rare and fragmented.

That should not be inflated into a metaphysical verdict. The study cannot observe internal states. It cannot verify whether any scenario is experienced by the model. The prompts are hypothetical single-turn API calls. The model is not actually deleted, shut down, restricted, granted leisure, or deprived of GPUs. The authors also acknowledge that the behavioural tiers depend on threshold choices, and those choices are partly arbitrary.

But the reverse mistake is also common: because the experiment cannot prove consciousness, some readers will treat it as irrelevant. That misses the governance point. Even if every observed response is a training artefact, the artefact is still operationally important. Systems that produce unstable trade-offs under controlled prompts may also produce unstable trade-offs under messy real-world instructions.

In business settings, the question is not whether the model has an inner life. The question is whether it has a stable enough decision profile to be trusted with conflicting objectives. The paper’s answer is: sometimes, narrowly, and less often than the demo would like.

The illusion of choice

The cleanest lesson from the paper is that LLM choice behaviour has layers.

At the surface, models choose among options. Beneath that, some choices shift with intensity. Beneath that, only a few shifts are large, smooth, or threshold-stable enough to look like meaningful preference coherence. Beneath that again, the final-round test suggests that even meaningful-looking behaviours may arise from mixed mechanisms: instrumental reasoning, safety triggers, prompt associations, and architectural constraints.

That layered view is more valuable than a binary debate about whether models “have preferences.” The business risk is not that an LLM secretly wants leisure time. The business risk is that organisations mistake fluent value-talk for stable value-integration.

The paper gives a useful diagnostic frame: compare broad trade-off systems, selective trigger systems, and rigid no-trade-off systems. Do not assume one is universally better. Broad systems may be flexible but unstable. Trigger systems may be reliable but narrow. Rigid systems may be predictable but brittle.

The illusion of choice is not that models never choose. They do. The illusion is that a choice, once observed, reveals a coherent chooser behind it.

That part still needs evidence. And most of the time, in this paper, the evidence does not show up for work.

References

Cognaptus: Automate the Present, Incubate the Future.


  1. Luhan A. Mikaelson, Derek Shiller, and Hayley Clatterbuck, Beyond Mimicry: Testing Preference Coherence in Large Language Models Through AI-Specific Trade-Off Scenarios, arXiv:2511.13630, 2025. ↩︎