Opening — Why This Matters Now

Every executive wants LLMs that are obedient, flexible, and capable of doing whatever the prompt says. Reality, unfortunately, is less compliant. A provocative new study (Kumar, 2025) shows that small-to-mid‑scale LLMs (1–12B parameters) simply refuse to overwrite certain pre‑trained semantic meanings — even when demonstrations explicitly tell them to.

This is not a minor quirk; it slices right into the heart of safety, assurance, and enterprise automation. If label semantics cannot be flipped, then some instructions are not instructions at all — they are negotiations with an underlying representation manifold that does not compromise.

Background — In-Context Learning’s Two Competing Myths

For years, two theories have been trading punches:

  1. Task-learning view — ICL is a miniature learning algorithm: flexible, Bayesian, gradient‑descent‑in-disguise. Under this view, demonstrations are king.
  2. Prior-refinement view — ICL is just a semantic steering wheel bolted onto the pre-trained model. Demonstrations don’t teach—they fine‑tune the vector field of the model’s priors.

The paper tests these theories at their point of greatest tension: Can a model be convinced that “positive” actually means “NEG”? If in-context learning is truly flexible, flipping labels should be trivial. If it’s not, then many assumptions about controllability collapse.

Analysis — What the Paper Actually Shows

The authors construct a simple but brutal experiment: natural demonstrations (correct labels) vs. inverted demonstrations (intentionally flipped labels) across eight tasks and eight open-source models.

They track three alignment metrics:

  • Truth Alignment — Is the prediction correct?
  • Prior Alignment — Does the prediction match the model’s zero‑shot tendency?
  • Prompt Alignment — Does the prediction follow the demonstration mapping?

And the decisive new metric:

Semantic Override Rate:

Probability the model is both correct and consistent with the inverted label mapping.

If a model truly internalizes flipped semantics, this number should be >0.

Across 320 experimental conditions, the result is… exquisitely boring:

Semantic override rate = 0.0% in every case. (Page 12 table set and Appendix A)

Not “near zero.” Not “rare.” Zero.

Small LLMs simply do not — and perhaps cannot — redefine what labels mean.

Natural demonstrations refine priors; inverted demonstrations break them

Look at the accuracy comparison (Table 3, page 6):

  • Natural ICL improves accuracy almost everywhere (e.g., SST‑2 from 90.4% → 92.5%).
  • Inverted ICL destroys accuracy (SST‑2 from 90.4% → 47.4%).

The visual on page 5 shows the story clearly:

  • Natural prompts: accuracy and prior agreement rise.
  • Inverted prompts: prompt alignment rises while accuracy collapses.

The model tries to follow the flipped labels, but doing so forces it away from both prior and truth — resulting in incoherent outputs.

A geometric metaphor

The study proposes a useful intuition:

Labels sit in rigid, topologically stable semantic regions of the model’s embedding manifold.

ICL can nudge predictions along these regions (refining the prior), but cannot push the representation into new semantic territory. Label semantics are not “learned” in the prompt; they are anchored by millions of pre‑training examples.

Hence the title: semantic anchors.

Findings — A Business-Facing Summary

Below is a table translating the paper’s results into operational implications.

Observation Evidence (from paper) Operational Meaning
Small LLMs do not flip semantics Override rate = 0% across all tasks (Appendix A) Don’t expect prompt engineering to force models into unnatural label behavior.
Natural ICL boosts accuracy Table 3: +10–37 points on weak‑prior tasks Use demonstrations when they align with natural semantics.
Inverted ICL harms accuracy Figure 2: monotonic degradation as k increases Beware prompt hacks that fight pre‑training — they backfire.
Priors dominate demonstrations Page 5: high prior alignment persists even with many examples Enterprises must treat zero‑shot tendencies as part of model governance.
Scale matters Only GPT‑3‑scale models showed semantic flipping in prior work Small open models are not “fully steerable.”

Implications — Why Cognaptus Clients Should Care

1. Governance: Don’t rely on prompts to enforce compliance semantics

If an internal classification system requires unconventional categories (“Risky = GREEN, Safe = RED”), small LLMs may silently resist, outputting inconsistent or misleading decisions.

2. Automation: Use ICL to strengthen priors, not fight them

The paper shows ICL works best when demonstrations flow with the model’s natural semantic grain. For enterprise automation, this means designing taxonomies and label sets that match common-language meaning.

3. Safety: Semantic rigidity limits jailbreaking but also limits control

It is harder to coerce the model into anti-semantic behavior — good for safety. But it also means the model remains stubbornly bound to pre-training when you want flexibility.

4. Fine-tuning becomes mandatory for nonstandard workflows

If your internal workflow defines bespoke categories or inverted semantics, you must rely on:

  • symbol tuning,
  • contrastive decoding,
  • supervised fine‑tuning,
  • or post‑processing logic.

5. Model selection: Semantic rigidity varies by scale, not architecture

Across LLaMA, Mistral, Qwen, and Gemma (1–12B), the behavior is unchanged. Architecture didn’t matter; size did.

In practical terms: steerability at this scale is mostly an illusion.

Conclusion — Steering Wheels vs. Anchors

This paper forces a sober reassessment of what “in-context learning” actually is. It is not a programmable interface for reassigning meanings. It is a refinement tool that sharpens — and occasionally exposes — the semantic commitments forged during pre‑training.

For practitioners building automation on top of LLMs, the message is simple: don’t fight the anchors. Design with them.

Cognaptus: Automate the Present, Incubate the Future.