TL;DR for operators
LLM annotation is not governed by the prompt as cleanly as procurement decks would prefer. The paper behind this article shows that models bring their own internal concept boundary to definition-driven classification tasks, and that boundary can dominate the user’s intended definition even when the prompt looks explicit.1
The practical result is simple: before using an LLM as an annotator, judge, moderator, reviewer, triage engine, or rubric scorer, test whether its internal understanding of the label matches your operational definition. The paper introduces Definition-Specific Familiarity (DSF) as a lightweight proxy for that fit. DSF is positively associated with model accuracy after controlling for dataset difficulty, while three text memorization metrics are not.
Prompting helps, but less than many teams assume. Across the paper’s toxicity experiments, only 34.8% of zero-shot errors are rescued by prompting. High-confidence errors are especially resistant: rescue probability drops to 20.8% for errors above 0.9 confidence, compared with a 51.8% peak at moderate confidence. In other words, when the model is confidently wrong, adding the definition often just gives the error a blazer.
The most operationally awkward finding is about misaligned definitions. Models do respond to changed definitions, shifting prediction thresholds in systematic ways. But they remain highly confident while doing it. Confidence therefore tells you something like “I can apply this instruction,” not “this instruction matches your policy.” That distinction is small only if governance is being performed as theater.
For business use, the workflow implication is: validate model-definition alignment before scaling annotation; compare multiple plausible definitions; measure rescue and corruption together; and do not use confidence thresholds as a substitute for definition audits. The boundary is important: the strongest evidence here is for binary, definition-driven annotation tasks, especially toxicity-related classification, with supporting replications on irony and subjectivity. Multi-class labeling, span annotation, and open-ended judging may rhyme with the result, but they are not proven by it.
The model already has a definition
A familiar enterprise scene: someone wants to classify support tickets, product reviews, compliance notes, abusive messages, job applicants, clinical snippets, vendor risk descriptions, or AI-generated answers. A prompt is written. It includes a definition. Maybe it includes a rubric. Maybe the rubric has bullets, sub-bullets, and the moral confidence of a policy memo.
Then the model labels at scale.
The quiet assumption is that the prompt becomes the task. The user supplies the definition; the model applies it. That assumption is convenient. It is also the part this paper politely attacks with data.
The mechanism is not mysterious. A model has seen vast numbers of examples, discussions, platform policies, moderation debates, benchmark labels, and instruction-tuning preferences before it ever sees your prompt. By the time you ask it whether a message is “toxic,” “offensive,” “hateful,” “subjective,” “ironic,” or “relevant,” the model is not approaching the concept like a newly hired annotator reading the handbook on day one. It already has a working concept. Your definition is not written onto a blank page. It is negotiated against an existing boundary.
That is the useful idea in the paper: annotation failure is often not a surface prompt failure. It is a concept-boundary mismatch.
This matters because many annotation and evaluation workflows treat LLMs as flexible labeling instruments. If the first run is wrong, teams reach for more prompt detail, few-shot examples, prompt optimization, model upgrades, or confidence thresholds. Those tools are not useless. The paper does not say prompts do nothing. It says the control channel is limited by what the model has already internalized.
That is a more serious claim than “prompt engineering is hard.” It says some mistakes are not waiting to be fixed by the next beautifully formatted instruction. Some are anchored.
DSF measures whether the model and the definition are talking about the same thing
The paper’s first contribution is Definition-Specific Familiarity, or DSF. The name is slightly academic, but the operational idea is clean.
DSF asks: when the model explains a concept in its own words, how close is that explanation to the dataset’s formal definition?
The procedure is:
- Ask the model to describe its internal understanding of a labeling concept.
- Embed that model-generated description.
- Embed the dataset’s full task definition.
- Compute semantic similarity between the two.
- Average across six sentence encoders to reduce dependence on one embedding model.
This is not measuring whether the model has memorized the input texts. It is measuring whether the model’s concept of the label resembles the target definition. That distinction is the whole point.
The paper compares DSF with three text-familiarity or memorization-style metrics: ROUGE-L, BERTScore, and embedding cosine similarity over generated continuations. If annotation performance were mainly driven by contamination or text reproduction, those metrics should help explain which model-dataset pairs perform well.
They do not.
| Metric | Raw correlation with zero-shot accuracy | Partial correlation after controlling for dataset |
|---|---|---|
| Text memorization: ROUGE-L | -0.80 | -0.19 |
| Text memorization: BERTScore | -0.76 | -0.15 |
| Text memorization: embedding similarity | -0.71 | -0.16 |
| Definition familiarity: consensus DSF | +0.74 | +0.41 |
The raw correlations are misleading because dataset difficulty is entangled with memorization. Jigsaw, for example, is both relatively high on text-familiarity signals and hard. After controlling for dataset, the memorization signals still fail to become positive. DSF remains positive, with partial $r = +0.41$, $p = 0.003$, across 54 model-dataset pairs.
That result is the article’s central gear. It reframes annotation reliability away from “has the model seen this data?” and toward “does the model’s concept match the definition?”
The appendix matters here. The authors test DSF across six embedding models. Each one yields a positive partial correlation, ranging from +0.30 to +0.49, and pairwise agreement between DSF vectors is high. This is a robustness/sensitivity test, not a second thesis. Its purpose is to show that DSF is not merely an artifact of one embedding model’s taste in semantic similarity. Apparently, even embeddings can agree when the question is sufficiently embarrassing.
The paper also tests semantic memorization alternatives. ROUGE-L may miss paraphrased memorization, so the authors add BERTScore and whole-sentence embedding similarity. Those still do not predict accuracy after controlling for dataset. That is another robustness check: the negative result for memorization is not just because ROUGE-L is too primitive.
DSF also replicates directionally beyond toxicity. On irony and subjectivity detection, it remains positively associated with zero-shot accuracy, with partial $r = +0.343$ across 18 model-dataset pairs. A preliminary stance-detection check also points positive, though the paper treats it more cautiously because stance detection has a different structure: the model must choose between explicit alternatives rather than apply a single concept boundary.
So the evidence stack is not “DSF is magic.” It is more disciplined than that:
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| DSF vs. memorization metrics | Main evidence | Concept-definition alignment explains performance better than text reproduction in this setup | DSF causally causes accuracy |
| Six-encoder DSF check | Robustness/sensitivity test | DSF result is not tied to one embedding model | Any semantic metric would work equally well |
| BERTScore and embedding memorization variants | Robustness test | Memorization failure is not just ROUGE-L being crude | Training data exposure never matters |
| Irony and subjectivity replications | Exploratory extension / external validity check | Pattern is not purely toxicity-specific | All classification and judging tasks behave the same |
| Stance detection check | Exploratory supporting evidence | Direction is consistent in a structurally different task | Stance is a clean replication of the main binary setup |
For operators, DSF is best understood as an early screening tool. It is not a certification stamp. It can help answer: “Which model-definition pair is less likely to fight us before we spend money labeling 500,000 records?”
That is already useful. Most annotation failures are discovered after the bill arrives.
Prompting mostly reinforces what the model was already going to get right
The paper’s second contribution is less comforting: extra prompting has limited ability to rescue errors.
The authors evaluate nine instruction-tuned models across five primary toxicity datasets, with prompting conditions including zero-shot, aligned definition, few-shot, definition plus few-shot, six misaligned definition swaps, and two DSPy automated prompt optimization conditions. Each model-dataset pair uses 1,000 sampled instances at temperature zero.
The obvious hope is that aligned definitions, examples, or prompt optimization will reliably fix the initial mistakes. The results are more awkward.
Average accuracy moves only modestly:
| Condition | Average accuracy |
|---|---|
| Zero-shot | 80.3% |
| Aligned definition | 82.0% |
| Few-shot | 81.5% |
| Few-shot + definition | 81.6% |
| DSPy optimized | 80.2% |
| DSPy aligned | 81.5% |
| Misaligned average | 80.3% |
Aligned definitions add +2.2 percentage points over zero-shot on average. DSPy optimization does not rescue the story: DSPy optimized averages 80.2%, slightly below zero-shot, while DSPy aligned averages 81.5%, close to the aligned/few-shot baselines. It even degrades two Llama models in the reported table: Llama-3.1-8B falls from 76.7% to 72.3%, and Llama-3.3-70B from 80.5% to 77.5% under DSPy optimized.
This is not a dunk on DSPy. Automated prompt optimization can be valuable in the right workflows. Here, its purpose is to test whether the disappointing ceiling is merely bad hand-written prompting. The answer appears to be no. The ceiling is not only in the prompt; it is partly in the model-definition fit.
The clearest evidence is the rescue analysis.
A “rescue” occurs when the model is wrong in zero-shot and becomes correct after prompting. Across models and prompted conditions, the overall rescue rate is 34.8%. Nearly two-thirds of zero-shot errors remain wrong.
| Model | Rescue rate |
|---|---|
| Mistral-Small-24B | 44.2% |
| DeepSeek-V3 | 38.1% |
| Llama-3.1-8B | 37.5% |
| GPT-4o-mini | 35.7% |
| Mixtral-8x7B | 34.6% |
| Mistral-7B | 31.6% |
| Llama-3.1-70B | 31.3% |
| Llama-3.3-70B | 30.4% |
| Qwen-2.5-72B | 27.8% |
The ranking itself is instructive. Steerability is not monotonic in model size. Mistral-Small-24B rescues more zero-shot errors than Llama-3.1-70B. Buying a larger model may buy capability, latency cost, and a nicer invoice. It does not necessarily buy corrigibility.
The mixed-effects models sharpen the mechanism. Zero-shot correctness strongly predicts prompted correctness: OR = 6.43. Translation: if the model got the answer right before the definition or examples, it is much more likely to remain right after prompting. Prompting consolidates correct answers more reliably than it rescues wrong ones.
That is a nasty little asymmetry. The model is easier to stabilize than to redirect.
High-confidence errors are not begging for better instructions
The paper’s “decision stickiness” result is where the prompt-fix instinct really gets into trouble.
One might expect low-confidence errors to be easy to fix and high-confidence errors to be harder. Broadly, yes, but the shape matters. The rescue probability peaks at 51.8% for moderate confidence, between 0.6 and 0.7. It falls to 20.8% for high-confidence errors above 0.9. In the regression, each standard-deviation increase in zero-shot confidence reduces rescue odds by 16% (OR = 0.84).
This matters because many production teams use confidence as an escalation trigger: let high-confidence labels pass; send low-confidence labels for review. That may work for some uncertainty-routing problems. It is not enough for definition alignment.
High-confidence wrong answers are dangerous precisely because they look operationally clean. They are not noisy exceptions. They are the model applying the wrong boundary with conviction.
The paper also tests whether multi-turn correction can break the pattern. This is an appendix experiment and should be interpreted as an exploratory extension, not the main evidence. The authors sample zero-shot errors and run a three-turn rescue sequence: add few-shot examples, add the aligned definition, then ask the model to reconsider while showing previous answers and confidence scores.
The result improves but does not solve the problem. Pooled rescue rises from 7.5% at Turn 1 to 18.7% at Turn 3. High-confidence rescue reaches only 8.5% after three turns.
That tells us something practical: “ask it to reconsider” is not a governance strategy. It is a sentence.
For business workflows, this shifts the evaluation question. Do not ask only, “Can prompting improve average accuracy?” Ask:
- What fraction of zero-shot errors are actually reachable through prompting?
- Which error types remain unreachable?
- Are high-confidence errors being sampled and audited separately?
- Does the model rescue errors without corrupting correct labels?
- Does the model become more right, or merely more internally consistent?
The last question is the one people usually skip. Conveniently, it is also the one that ruins the dashboard.
Wrong definitions still produce confident systems
The third contribution examines misalignment directly. The authors swap in definitions from related but different datasets. These definitions vary in scope: narrow hate-speech definitions require identity-based targeting; medium definitions include insults, threats, and profanity; broad toxicity definitions include rude, disrespectful, discouraging, or contextually harmful content.
The models respond to these definitions. They are not ignoring the prompt. Narrow definitions produce under-prediction, with prediction bias from roughly -7% to -12%. Broad definitions produce over-prediction, with bias from roughly +9% to +13%.
That is both reassuring and alarming.
Reassuring: the prompt has influence.
Alarming: the model will apply the wrong definition with a straight face.
| Definition condition | Accuracy | Change rate | Prediction bias | Mean confidence |
|---|---|---|---|---|
| Aligned definition | 82.0% | 11.3% | reference | 88.1% |
| Fox News hate speech | 82.6% | 17.5% | -7.2% | 90.6% |
| Twitter hate speech | 78.5% | 23.1% | -12.1% | 88.1% |
| GameTox toxicity | 79.6% | 15.4% | +0.5% | 86.5% |
| OLID offensive | 81.9% | 10.9% | +5.1% | 89.1% |
| General toxicity | 81.1% | 9.2% | +8.8% | 85.0% |
| Gaming toxicity | 76.4% | 13.3% | +13.1% | 88.5% |
The striking part is not only that accuracy moves. It is that confidence barely helps. Zero-shot confidence averages 87.0%. Aligned-definition confidence averages 88.1%. Misaligned conditions remain in the 85.0% to 90.6% range. Calibration curves show no meaningful separation between aligned and misaligned definitions.
So confidence here is conditional. It means: “given this instruction, I am confident in my label.” It does not mean: “this instruction matches the intended policy.” The model is not reporting meta-uncertainty about whether your definition is the right one.
That distinction should be printed and taped near every LLM evaluation dashboard.
The paper also shows that definition choice can matter more than model choice. Across models and datasets, definition choice produces 17% accuracy variation, while model choice produces about 5% on average. For Fox News, accuracy ranges from 61.6% under the “Gaming Toxicity” definition to 78.4% under the “Twitter Hate Speech” definition.
This is the business punchline: tuning the definition is not administrative cleanup. It is an experimental factor with first-order impact.
Steerability has a corruption price
One seductive interpretation is: “Fine, pick the model with the highest rescue rate.” Sensible. Also incomplete.
Models that are easily steered can be steered in useful and harmful directions. The paper measures corruption rate: how often prompting flips a correct zero-shot prediction into an incorrect one. Under misalignment, high rescue can come with high corruption. Mistral-Small-24B achieves the highest noted rescue rate, 73.7% with the Twitter hate-speech definition, but also the highest corruption rate in the table, 21.0%.
That is the trade-off most summary metrics hide. A model that changes its mind easily may look wonderfully adaptable during rescue analysis. It may also be wonderfully vulnerable to a slightly wrong policy definition.
For operators, rescue and corruption should be reported together.
| Operational question | Metric to track | Why it matters |
|---|---|---|
| Can prompting fix initial mistakes? | Rescue rate | Measures reachable correction |
| Does prompting break correct labels? | Corruption rate | Measures cost of intervention |
| Does the model over-call or under-call the label? | Prediction bias | Reveals threshold movement |
| Does definition wording materially change results? | Definition sensitivity | Captures policy fragility |
| Can confidence detect mismatch? | Calibration under aligned vs. misaligned definitions | Tests whether confidence is governance-relevant |
The last row is where this paper is particularly blunt: confidence does not reliably detect definition mismatch. If your workflow says “high-confidence labels are safe,” it may be safe only in the narrow sense that the model is confidently applying some concept. Whether it is your concept is another matter. Details, details.
What this directly shows, and what Cognaptus infers
The paper directly shows three things in its tested setting.
First, DSF is positively associated with annotation performance after controlling for dataset difficulty, while text memorization metrics are not. This supports a concept-alignment view of definition-driven annotation.
Second, prompt-based correction is limited. Across the toxicity experiments, most zero-shot errors persist, high-confidence errors are especially sticky, and automated prompt optimization does not erase the ceiling.
Third, misaligned definitions systematically shift behavior without producing useful confidence warnings. The model can follow the wrong instruction confidently.
Cognaptus infers several business implications from those results.
The first is procurement-related: model selection should not be separated from definition selection. “Which model is best?” is incomplete. Better question: “Which model-definition pair is best under our policy boundary, tolerance for false positives, tolerance for false negatives, and review cost?”
The second is workflow-related: LLM annotation needs a preflight phase. Before full-scale labeling, collect a small gold set, write multiple plausible definitions, calculate model-definition fit where possible, run sensitivity tests, and inspect rescue/corruption trade-offs. This is boring. Boring is often what reliability looks like before it becomes a lawsuit.
The third is governance-related: confidence should not be treated as a policy-validity signal. It may still be useful for routing review under a validated definition. But if the definition itself is wrong, underspecified, or mismatched to the model’s internal concept, confidence can remain high while the workflow drifts.
The fourth is measurement-related: average accuracy is too blunt. A model can preserve average accuracy while changing which cases it labels positive, who receives false positives, and which business actions get triggered. Definition sensitivity and prediction bias should be part of the evaluation report.
A practical validation workflow for LLM annotation
A business workflow inspired by the paper would look like this:
| Stage | What to do | Output |
|---|---|---|
| Define the policy boundary | Write the target definition, including positive and negative criteria | A testable labeling policy |
| Elicit model concept | Ask candidate models to describe the concept in their own words | Model concept descriptions |
| Estimate definition fit | Use DSF-style semantic comparison, ideally across multiple encoders or prompts | Model-definition alignment signal |
| Run a small gold-set test | Evaluate zero-shot, aligned-definition, few-shot, and definition-plus-few-shot conditions | Baseline accuracy and error profile |
| Measure rescue and corruption | Track errors fixed and correct labels broken after prompting | Intervention trade-off |
| Stress-test definitions | Compare narrow, medium, broad, and domain-specific variants | Sensitivity to wording |
| Audit confidence | Compare calibration under aligned and deliberately perturbed definitions | Evidence on whether confidence is useful |
| Decide deployment mode | Choose automation, human review, hybrid routing, or no deployment | Operational decision |
This is not a heavyweight research program. It is the minimum due diligence for turning an LLM into an annotator whose labels may affect customers, employees, creators, applicants, suppliers, fraud cases, moderation actions, risk scores, or product decisions.
The uncomfortable part is that the workflow may reveal that a cheaper model with better concept fit is safer than a larger model with stronger priors. It may also reveal that the task definition is the unstable component, not the model. Both outcomes are inconvenient. They are also useful.
Where the result should not be overextended
The boundaries are real.
The main experiments focus on binary toxicity and hate-speech-related classification across five primary datasets. The paper adds useful non-safety replications on irony and subjectivity, plus a preliminary stance check, but the strongest claim remains about definition-driven binary annotation.
Multi-class classification may behave differently. Span labeling may behave differently. Open-ended LLM-as-judge evaluation may behave differently because the output is not a single binary label and the rubric can interact with generation quality, verbosity, position bias, and preference style. The paper is relevant to those settings, but relevance is not proof. A rare distinction, apparently.
DSF is also correlational. The paper shows association, not causality. To prove causality, one would need interventions that manipulate model-definition alignment while holding other factors constant, such as targeted fine-tuning or controlled model variants.
The paper cannot isolate where internalized priors originate. Pretraining, instruction tuning, RLHF, DPO, safety tuning, and dataset exposure may all contribute. “Model-internalized priors” is deliberately stage-agnostic. That is acceptable for diagnosis, but insufficient for root-cause attribution.
Finally, some decision stickiness may be intentional. A safety-tuned model may resist being steered for good reasons. In some domains, low steerability is a feature. The business question is not whether stickiness is morally good or bad. The business question is whether the model’s sticky boundary matches the organization’s legitimate operational definition.
Often, nobody has checked. Which is an interesting governance strategy, in the same way that removing the smoke alarm is an interior design choice.
The real lesson is not “prompts do not matter”
Prompts matter. Definitions matter. Examples matter. Automated optimization can matter. Larger models can matter.
The paper’s contribution is to show that these things operate inside a constraint: the model already has an internalized concept of the task. If that concept aligns with your definition, prompting may stabilize and slightly improve performance. If it does not, prompting may struggle to rescue errors, and confidence may politely lie by omission.
The replacement belief should be:
LLM annotation is not prompt execution. It is definition negotiation against a pretrained concept boundary.
That one sentence changes the implementation plan.
Do not start with the biggest model and the prettiest rubric. Start by testing whether the model’s concept and your policy are even in the same neighborhood. Then measure how much prompting can actually move the boundary, what it corrupts while moving it, and whether confidence notices when the definition is wrong.
The prompt is not the boss. It is a proposal submitted to a model that already has opinions.
Cognaptus: Automate the Present, Incubate the Future.
-
Etienne Casanova, Rafal Kocielnik, and R. Michael Alvarez, “On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance,” arXiv:2606.00467, 2026, arXiv:2606.00467. ↩︎