Opening — Why This Matters Now
Every executive wants LLMs that are obedient, flexible, and capable of doing whatever the prompt says. Reality, unfortunately, is less compliant. A provocative new study (Kumar, 2025) shows that small-to-mid‑scale LLMs (1–12B parameters) simply refuse to overwrite certain pre‑trained semantic meanings — even when demonstrations explicitly tell them to.
This is not a minor quirk; it slices right into the heart of safety, assurance, and enterprise automation. If label semantics cannot be flipped, then some instructions are not instructions at all — they are negotiations with an underlying representation manifold that does not compromise.
Background — In-Context Learning’s Two Competing Myths
For years, two theories have been trading punches:
- Task-learning view — ICL is a miniature learning algorithm: flexible, Bayesian, gradient‑descent‑in-disguise. Under this view, demonstrations are king.
- Prior-refinement view — ICL is just a semantic steering wheel bolted onto the pre-trained model. Demonstrations don’t teach—they fine‑tune the vector field of the model’s priors.
The paper tests these theories at their point of greatest tension: Can a model be convinced that “positive” actually means “NEG”? If in-context learning is truly flexible, flipping labels should be trivial. If it’s not, then many assumptions about controllability collapse.
Analysis — What the Paper Actually Shows
The authors construct a simple but brutal experiment: natural demonstrations (correct labels) vs. inverted demonstrations (intentionally flipped labels) across eight tasks and eight open-source models.
They track three alignment metrics:
- Truth Alignment — Is the prediction correct?
- Prior Alignment — Does the prediction match the model’s zero‑shot tendency?
- Prompt Alignment — Does the prediction follow the demonstration mapping?
And the decisive new metric:
Semantic Override Rate:
Probability the model is both correct and consistent with the inverted label mapping.
If a model truly internalizes flipped semantics, this number should be >0.
Across 320 experimental conditions, the result is… exquisitely boring:
Semantic override rate = 0.0% in every case. (Page 12 table set and Appendix A)
Not “near zero.” Not “rare.” Zero.
Small LLMs simply do not — and perhaps cannot — redefine what labels mean.
Natural demonstrations refine priors; inverted demonstrations break them
Look at the accuracy comparison (Table 3, page 6):
- Natural ICL improves accuracy almost everywhere (e.g., SST‑2 from 90.4% → 92.5%).
- Inverted ICL destroys accuracy (SST‑2 from 90.4% → 47.4%).
The visual on page 5 shows the story clearly:
- Natural prompts: accuracy and prior agreement rise.
- Inverted prompts: prompt alignment rises while accuracy collapses.
The model tries to follow the flipped labels, but doing so forces it away from both prior and truth — resulting in incoherent outputs.
A geometric metaphor
The study proposes a useful intuition:
Labels sit in rigid, topologically stable semantic regions of the model’s embedding manifold.
ICL can nudge predictions along these regions (refining the prior), but cannot push the representation into new semantic territory. Label semantics are not “learned” in the prompt; they are anchored by millions of pre‑training examples.
Hence the title: semantic anchors.
Findings — A Business-Facing Summary
Below is a table translating the paper’s results into operational implications.
| Observation | Evidence (from paper) | Operational Meaning |
|---|---|---|
| Small LLMs do not flip semantics | Override rate = 0% across all tasks (Appendix A) | Don’t expect prompt engineering to force models into unnatural label behavior. |
| Natural ICL boosts accuracy | Table 3: +10–37 points on weak‑prior tasks | Use demonstrations when they align with natural semantics. |
| Inverted ICL harms accuracy | Figure 2: monotonic degradation as k increases | Beware prompt hacks that fight pre‑training — they backfire. |
| Priors dominate demonstrations | Page 5: high prior alignment persists even with many examples | Enterprises must treat zero‑shot tendencies as part of model governance. |
| Scale matters | Only GPT‑3‑scale models showed semantic flipping in prior work | Small open models are not “fully steerable.” |
Implications — Why Cognaptus Clients Should Care
1. Governance: Don’t rely on prompts to enforce compliance semantics
If an internal classification system requires unconventional categories (“Risky = GREEN, Safe = RED”), small LLMs may silently resist, outputting inconsistent or misleading decisions.
2. Automation: Use ICL to strengthen priors, not fight them
The paper shows ICL works best when demonstrations flow with the model’s natural semantic grain. For enterprise automation, this means designing taxonomies and label sets that match common-language meaning.
3. Safety: Semantic rigidity limits jailbreaking but also limits control
It is harder to coerce the model into anti-semantic behavior — good for safety. But it also means the model remains stubbornly bound to pre-training when you want flexibility.
4. Fine-tuning becomes mandatory for nonstandard workflows
If your internal workflow defines bespoke categories or inverted semantics, you must rely on:
- symbol tuning,
- contrastive decoding,
- supervised fine‑tuning,
- or post‑processing logic.
5. Model selection: Semantic rigidity varies by scale, not architecture
Across LLaMA, Mistral, Qwen, and Gemma (1–12B), the behavior is unchanged. Architecture didn’t matter; size did.
In practical terms: steerability at this scale is mostly an illusion.
Conclusion — Steering Wheels vs. Anchors
This paper forces a sober reassessment of what “in-context learning” actually is. It is not a programmable interface for reassigning meanings. It is a refinement tool that sharpens — and occasionally exposes — the semantic commitments forged during pre‑training.
For practitioners building automation on top of LLMs, the message is simple: don’t fight the anchors. Design with them.
Cognaptus: Automate the Present, Incubate the Future.