Fine-Tuned, Fine Print: Why Post-Training Teaches Models What to Trust

Enterprise AI has entered its “sure, but can it use the evidence?” phase.

That is progress, technically. It is also where many deployment stories begin to get expensive. The first generation of business LLM adoption was satisfied if a model could produce a fluent answer. The next generation asks something more demanding: can the model use retrieved documents, compliance policies, tool outputs, customer records, analyst notes, and human feedback in the right way?

This sounds like a context problem. Give the model better documents. Improve retrieval. Add citations. Wrap it in a RAG pipeline. Sprinkle some “answer only from the provided context” into the prompt, because apparently enterprise reliability is now a seasoning.

The deeper problem is less comforting. Post-training does not merely teach a model to be more helpful, more aligned, or more context-aware. It teaches the model which signals deserve trust. Some of those signals are good. Some are merely convenient. Some are accidental artifacts of the dataset. Some are baked into the optimization objective itself.

Two recent papers make this point from different layers of the post-training stack. One paper, Emergence of Context Characteristics Sensitivity in Large Language Models, examines how instruction fine-tuning stages shape whether models use provided context, focusing on features such as lexical similarity, fluency, and length.¹ The other, A Regret Minimization Framework on Preference Learning in Large Language Models, argues that preference learning should not always be interpreted as reward maximization, because human feedback often reflects prospective and counterfactual judgment.²

Read together, they form a useful warning: robust enterprise LLM behavior requires both data-level discipline and objective-level discipline. It is not enough to ask whether a model has been post-trained. The more practical question is: what did post-training teach it to treat as evidence?

The shared problem: supervision is not neutral

The common thread between the two papers is not “DPO is good” or “DPO is bad.” That would be too tidy, and therefore suspicious.

The shared problem is supervision. More precisely: when we fine-tune a model with examples, preferences, or verifiable rewards, we are not only improving performance on a benchmark. We are shaping the model’s internal policy for deciding what matters.

In an enterprise setting, this matters because the model is rarely operating on clean textbook prompts. It is asked to reason over messy operational evidence:

Business use case	What the model receives as “context”	What can go wrong
RAG knowledge assistant	Retrieved documents, policy pages, FAQs	The model follows the most fluent or familiar-looking passage, not the most relevant one
Compliance copilot	Rules, exceptions, case notes, audit trails	The model overweights readable summaries and underweights ugly but authoritative evidence
Customer-service agent	CRM history, chat logs, refund rules	The model learns shortcuts from prior preferred responses rather than actual policy logic
Research assistant	Papers, tables, citations, claims	The model treats lexical overlap as support, which is adorable until it becomes a false claim
Internal decision assistant	Reports, forecasts, human approvals	Preference labels may encode organizational habits, not decision quality

The first paper shows that context usage is actively reshaped during instruction fine-tuning. The second paper asks whether the way we interpret preference feedback is theoretically adequate in the first place. One paper looks at the model’s visible behavior around context; the other looks at the optimization lens underneath preference learning.

That is why the two papers should not be read as separate summaries. They form a logic chain:

Models must increasingly use external evidence.
Post-training shapes how models decide which evidence to use.
Supervised fine-tuning can teach shallow context shortcuts.
Preference optimization can either reduce or reinforce these shortcuts depending on the preference data.
Even with cleaner preference data, the reward-maximization view may misread what feedback means.
Therefore, enterprise AI needs evidence audits, preference-data audits, and objective-fit audits.

Yes, “audit the audit process” is not as glamorous as “agentic transformation.” It is also closer to how reliable systems are built.

Layer one: context reliance can become shortcut reliance

The context-sensitivity paper studies how models’ sensitivity to context characteristics changes across instruction fine-tuning stages: supervised fine-tuning, direct preference optimization, and reinforcement learning with verifiable rewards.

The authors focus on three families of context characteristics:

Context characteristic	Practical meaning	Why it matters
Similarity	Does the context share words or entities with the query?	High overlap may look relevant even when it is not enough to justify the answer
Fluency	Is the context readable and natural?	Fluent evidence can feel trustworthy even when it is false or incomplete
Length	Is the context shorter or longer?	Shorter passages may be easier to integrate; longer passages may appear more authoritative

The paper evaluates whether these characteristics predict when the model answers according to the provided context. The key finding is not simply that models use context. It is that supervised fine-tuning consistently pushes models toward easy-to-understand context features: higher query-context similarity, more fluent text, and shorter contexts.

That is a subtle but important distinction. A model can become better at appearing context-aware while becoming more dependent on shortcuts that merely correlate with usable context.

For business readers, the unpleasant translation is this:

A model may learn to follow context when the context looks easy, familiar, and clean — not necessarily when it is correct, authoritative, or decision-relevant.

This is particularly relevant for RAG systems. Retrieval pipelines often already rank documents using similarity. If the model’s post-training also increases reliance on similarity, the whole system can become a similarity amplifier. Retrieval says, “this passage overlaps with the question.” The model says, “wonderful, I trust overlap.” The final answer then arrives wearing the uniform of evidence-based reasoning, even if the actual evidence relationship was thin.

That is how a workflow becomes formally grounded but substantively fragile. Very enterprise. Very deck-ready. Very dangerous.

DPO is not a magic detergent

The first paper also examines what happens after supervised fine-tuning, especially during DPO.

The interesting finding is not that DPO has one universal effect. It does not. In some cases, DPO appears to reduce sensitivities learned during SFT. In other cases, it can reinforce them. The authors connect this to differences in the preference dataset: if chosen and rejected responses differ along context characteristics, the model can learn those characteristics as preference signals.

That is the bridge from context usage to preference-data design.

A preference pair is not just “better answer” versus “worse answer.” It is also a bundle of uncontrolled features. The chosen answer may be shorter. Or clearer. Or more lexically aligned with the prompt. Or based on a cleaner-looking context. If these accidental features systematically differ between chosen and rejected samples, the model may learn the accident.

A simple way to state the problem:

$$ \text{Preference Signal} = \text{Intended Quality Signal} + \text{Dataset Artifacts} $$

The training algorithm does not automatically know which part is which. It optimizes what it sees. Models, being machines rather than moral philosophers, do not pause to ask whether the dataset designer meant to reward fluency, brevity, overlap, confidence, citation style, or actual correctness.

This matters because many companies now collect preference data informally. Thumbs-up/down logs. Human reviewer choices. Escalation outcomes. “Good response” examples from support agents. Edits from legal reviewers. Internal chatbot feedback. These datasets look operationally authentic, which makes them tempting.

But operationally authentic does not mean statistically balanced or behaviorally clean.

A support team might prefer shorter responses because customers dislike walls of text. A compliance team might prefer conservative language because legal risk is asymmetric. A sales team might prefer persuasive answers because the KPI is conversion. These are not necessarily wrong preferences. The problem is pretending they are pure evidence of general answer quality.

Layer two: preference feedback may not be reward

The second paper moves deeper. It does not primarily ask whether preference datasets are balanced. It asks whether the standard reward-maximization interpretation of preference learning is the right abstraction.

In much of RLHF and DPO-style thinking, preferences are treated as signals from which the model can infer what is better. A preferred output receives higher implicit reward than a rejected output. This framing is mathematically convenient and operationally powerful.

The paper argues that it can also be conceptually incomplete.

Human feedback is often prospective. When humans judge a partial reasoning path, they do not only evaluate what has already happened. They imagine how the reasoning is likely to continue. A step that has not yet produced the final answer may still look promising if it is on a good path. Another step may look superficially useful but likely to lead into a ditch. Humans can sense the ditch. Models trained with myopic reward framing may not.

Human feedback is also counterfactual. People compare what happened with what could have happened. A decision may receive a good realized outcome but still be judged poor if it took unnecessary risk. Another decision may produce a less impressive outcome but be preferred because it was robust across plausible alternatives.

This is where the paper’s regret-minimization framing enters. Instead of treating preference as immediate reward, it interprets preference as relative suboptimality: how far a behavior is from what could have been achieved under a better decision at the same context.

For a business audience, the clean version is:

Reward view	Regret view
“Which response got the better observed outcome?”	“Which response was closer to the better decision path, considering plausible continuations?”
Local and outcome-centered	Prospective and counterfactual
Convenient for simple preference pairs	Better aligned with partial reasoning, trajectories, and heterogeneous feedback
Can misread feedback as scalar utility	Treats feedback as judgment about avoidable suboptimality

This distinction matters most when the model is trained on intermediate reasoning, agent trajectories, or human judgments about process quality.

Imagine reviewing two AI-generated financial analyses. One happens to land on a correct recommendation after sloppy reasoning. The other uses a sound framework but stops before incorporating the last relevant variable. A purely outcome-flavored preference signal may overvalue the lucky answer. A regret-flavored view is more interested in which path was closer to robust decision-making.

That is not just theory. Enterprise workflows often care about process reliability, not only final-answer charm. In credit assessment, procurement, legal review, medical triage, and investment analysis, a lucky answer is still operationally embarrassing if the reasoning path is brittle. Nobody wants a compliance assistant that is “right for the wrong reason” until the wrong reason becomes expensive.

The combined lesson: post-training teaches trust policies

The two papers become more useful when placed into one post-training signal stack.

Layer	What the paper cluster suggests	Business control
Context characteristics	Models may learn to follow context based on similarity, fluency, or length	Test whether answer correctness depends on surface features of retrieved evidence
Preference pairs	DPO can inherit or amplify accidental differences between chosen and rejected examples	Balance preference datasets across irrelevant features such as length, tone, readability, and overlap
Feedback interpretation	Human feedback may encode anticipated futures and counterfactual alternatives	Document what reviewers were judging: final answer, reasoning path, risk, policy compliance, or user satisfaction
Objective design	Reward maximization may be insufficient for trajectory-level or process-level judgment	Match the training objective to the feedback type, especially for agents and reasoning workflows
Evaluation	Accuracy alone may miss shortcut learning	Evaluate evidence use, refusal behavior, robustness to misleading context, and consistency under paraphrase

The first paper says: be careful, the model may learn shallow context preferences during post-training.

The second says: be even more careful, the objective may misunderstand what your preference labels mean.

Together, they shift the business question from “Should we fine-tune?” to “What supervision contract are we writing into the model?”

That phrase — supervision contract — is useful. Every post-training dataset implicitly defines a contract between the organization and the model:

What counts as good evidence?
What counts as a good response?
What trade-offs should the model internalize?
What should be ignored as irrelevant surface variation?
What future behavior is being encouraged, not just what past answer was approved?

Most organizations do not write this contract explicitly. They outsource it to whatever data they have lying around. Then they call the result alignment. This is efficient in the same way that using yesterday’s coffee as engine oil is efficient.

What the papers show, and what businesses should infer

It is worth separating the papers’ actual claims from the business interpretation.

What the papers show

The context-sensitivity paper shows that context usage is not a fixed model property. It changes across post-training stages. During SFT, models become more sensitive to easy-to-understand context features. DPO’s effect depends on the characteristics of preference data. RLVR mostly preserves the direction of DPO effects in the authors’ setup. The paper’s recommendation is careful dataset design, especially balancing chosen and rejected responses so irrelevant context characteristics do not become training signals.

The regret-minimization paper shows a theoretical and empirical case for interpreting preferences through regret rather than simple reward maximization. It argues that human feedback is often prospective and counterfactual, derives a regret-based preference optimization approach, and reports improvements over DPO-style baselines across human preference and mathematical reasoning benchmarks. It also introduces a deterministic approximation for settings where behavior-policy information is unavailable.

What businesses should infer

The practical implication is not “replace all DPO with RePO tomorrow.” That would be the sort of conclusion that keeps conference booths warm and deployment teams cold.

The better inference is this: post-training governance needs to inspect both the data and the objective.

For many business systems, the first wave of improvement should be measurement, not exotic optimization. Before changing objectives, companies should know whether their models are already learning unwanted shortcuts from context and preferences.

Useful diagnostic questions include:

Diagnostic question	Why it matters
Does answer accuracy change when the same evidence is rewritten in less fluent language?	Tests whether the model overtrusts polish
Does the model prefer shorter evidence even when longer evidence contains necessary exceptions?	Tests whether compression is being mistaken for reliability
Do chosen and rejected preference examples differ systematically in length, tone, or citation density?	Detects accidental reward features
Are reviewers judging final correctness, process quality, risk avoidance, or customer tone?	Clarifies what the preference label actually means
Were preference examples produced by the same model being trained, another model, or humans using different tools?	Helps identify behavior-policy mismatch
Do evaluations include adversarial or misleading context?	Tests whether the model follows evidence robustly or gullibly

This is not glamorous. It is also where measurable reliability starts.

Why this matters more for agents than chatbots

For a simple chatbot, shortcut learning is annoying. For an agentic workflow, it can become structural failure.

Agents operate across trajectories. They search, retrieve, call tools, summarize, ask for clarification, choose actions, and revise plans. Feedback on these systems is rarely about one isolated message. It is about a path.

A manager may approve an agent’s output because the final report is acceptable. But the agent may have skipped a required database check. A legal reviewer may reject an answer because it lacks a caveat, even if the main conclusion is correct. A customer may rate a support interaction poorly because the process felt unhelpful, not because the final policy statement was wrong.

These are trajectory-level judgments. Treating them as simple scalar rewards risks compressing too much meaning into too little supervision.

The regret-minimization framing is important here because it matches how people often judge workflows: not just by the observed endpoint, but by avoidable mistakes along the path. In business language, regret is close to “what a competent process would not have done under the same information.”

That is exactly the kind of concept enterprise AI governance needs.

A mature agent evaluation should therefore ask:

Did the agent use the right evidence?
Did it ignore misleading evidence?
Did it choose the right tool at the right time?
Did it stop when uncertainty was too high?
Did it preserve compliance constraints across the full workflow?
Did it reach the answer through a process we would be willing to repeat?

The answer may be correct. The process may still be unacceptable. This is where “AI achieved the KPI” quietly turns into “AI optimized the wrong proxy.” A classic, now with more GPUs.

A practical framework: evidence discipline plus preference discipline

The combined business lesson can be turned into a simple framework.

1. Evidence discipline

Evidence discipline asks whether the model uses the right context for the right reason.

For RAG and document-based systems, this means evaluation sets should vary surface characteristics while preserving semantic evidence. Rewrite the same supporting document in a less fluent style. Add irrelevant but lexically similar distractors. Compare short summaries with longer policy passages containing exceptions. Test whether the answer changes when the same authority is presented less elegantly.

A reliable model should not treat readability as authority. It can use readability to process evidence, but not to decide truth.

2. Preference discipline

Preference discipline asks whether chosen and rejected examples differ only in the intended dimension.

If the intended preference is correctness, then length, tone, formatting, and confidence should be controlled or at least measured. If the intended preference is customer empathy, then correctness should not be accidentally degraded in the empathetic examples. If the intended preference is compliance conservatism, the dataset should distinguish cautious accuracy from vague refusal.

Otherwise the model may learn the easiest correlate.

This is especially relevant for internal feedback logs. Real user feedback is valuable, but messy. It reflects role incentives, mood, interface design, reviewer fatigue, and local business norms. Before training on it, treat it as a behavioral dataset, not as holy scripture.

3. Objective discipline

Objective discipline asks whether the optimization method matches the type of feedback.

For final-answer tasks with clear labels, conventional preference optimization may be sufficient. For reasoning paths, agent workflows, and partial trajectories, a richer interpretation may be needed. The second paper’s regret framing is one candidate direction: feedback may reflect anticipated continuation and counterfactual alternatives, not merely immediate reward.

The business version is simple: do not train a process-quality system as if every judgment were a final-answer popularity contest.

The likely misconception: more post-training equals more reliability

The most dangerous reader misconception is that post-training is a reliability solvent. Add SFT. Add DPO. Add RLHF. Add RLVR if you have verifiers. The model becomes aligned, obedient, grounded, and ready for enterprise deployment. Please proceed to procurement.

The papers push against that story.

Post-training can make models more useful. It can also make them more sensitive to the wrong features. It can reduce shortcuts if the preference data counteracts them. It can reinforce shortcuts if the preference data encodes them. It can optimize preference labels while misunderstanding the judgment process that produced those labels.

The question is not whether post-training works. It clearly can. The question is whether it works on the signal you actually care about.

For business leaders, this leads to a more sober deployment principle:

Do not evaluate an AI system only by whether it gives good answers. Evaluate whether it learned a good policy for trusting evidence and feedback.

That policy is where reliability lives.

What to do before fine-tuning your next model

A practical enterprise checklist follows naturally from the paper cluster.

Step	Action	Measurement
1	Profile context features in evaluation data	Similarity, fluency, length, source type, authority level
2	Stress-test evidence use	Accuracy under paraphrase, distractors, misleading fluent context
3	Audit preference pairs	Differences between chosen and rejected examples beyond intended quality
4	Label reviewer intent	Was feedback about correctness, tone, policy, process, risk, or completeness?
5	Track behavior-policy provenance	Which model or human process generated each trajectory?
6	Match objective to feedback	Final-answer preference, process preference, verifier-based reward, or regret-like trajectory judgment
7	Evaluate shortcut resistance	Does the model still perform when surface cues are decoupled from truth?

This is not an argument against post-training. It is an argument against post-training as ritual.

The best enterprise AI systems will not be the ones that merely collect more feedback. They will be the ones that understand what their feedback means.

Closing: alignment is a signal-engineering problem

The two papers are useful because they make post-training less mystical. They show that model behavior after fine-tuning is shaped by concrete signal channels: context properties, chosen/rejected differences, behavior-policy mismatch, and assumptions about human preference.

That is good news. If the problem is signal engineering, it can be inspected. If it can be inspected, it can be governed. And if it can be governed, perhaps enterprise AI can finally move beyond the phase where every demo looks brilliant and every production rollout needs a priest.

The combined lesson is blunt:

Evidence quality is not enough.
Preference labels are not self-explanatory.
DPO is not a magic detergent.
RLHF is not automatically human-aligned.
Post-training teaches models what to trust, whether or not we meant to teach it.

So the next time someone says a model has been “aligned,” ask a less decorative question: aligned to which signal?

That is where the real story begins.

Cognaptus: Automate the Present, Incubate the Future.

Nadya Yuki Wangsajaya, Haeun Yu, and Isabelle Augenstein, “Emergence of Context Characteristics Sensitivity in Large Language Models,” arXiv:2606.09525, 2026, https://arxiv.org/html/2606.09525. ↩︎
Suhwan Kim, Taehyun Cho, Geon-Hyeong Kim, Yu Jin Kim, Youngsoo Jang, Moontae Lee, and Jungwoo Lee, “A Regret Minimization Framework on Preference Learning in Large Language Models,” arXiv:2606.09124, 2026, https://arxiv.org/html/2606.09124. ↩︎

The shared problem: supervision is not neutral#

Layer one: context reliance can become shortcut reliance#

DPO is not a magic detergent#

Layer two: preference feedback may not be reward#

The combined lesson: post-training teaches trust policies#

What the papers show, and what businesses should infer#

What the papers show#

What businesses should infer#

Why this matters more for agents than chatbots#

A practical framework: evidence discipline plus preference discipline#

1. Evidence discipline#

2. Preference discipline#

3. Objective discipline#

The likely misconception: more post-training equals more reliability#

What to do before fine-tuning your next model#

Closing: alignment is a signal-engineering problem#