Users do not ask questions in benchmark format.
They ask in fragments, emails, forms, meeting notes, support tickets, spreadsheet comments, and occasionally in the sort of sentence that makes a compliance officer stare silently at the ceiling. A business AI agent does not receive one clean canonical prompt. It receives the same task wearing many costumes.
That is the problem behind Semantic Invariance in Agentic AI, a paper by I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, and Carlos T. Calafate.1 The paper asks a simple but uncomfortable question: if two prompts mean the same thing, does the model reason the same way?
The answer is: not reliably.
That sounds almost too obvious. Everyone who has used LLMs has seen wording sensitivity. But the paper’s value is not that it says prompts matter. We already knew that. The useful part is that it turns a vague operational anxiety into a testable reliability property: semantic invariance.
A model is semantically invariant when the same problem, rewritten without changing its meaning, produces equivalent reasoning and conclusions:
$$ M(p) \equiv M(\tau(p)) $$
Here, $p$ is the original problem, $\tau(p)$ is a transformed version of that problem, and $M$ is the LLM-based reasoning agent. In plain English: if the facts are the same, the answer should not wander off because the wording changed. A modest request, really. Apparently still too ambitious.
The paper’s central lesson for businesses is sharper than “test prompts more.” It is this: benchmark performance and deployment reliability are not the same thing. A model can score well on the original version of a problem and still become unstable when the same problem is reordered, expanded, shortened, or surrounded by distracting context.
For agentic AI, that distinction matters. Chatbots can be forgiven for sounding odd. Agents make decisions, call tools, update records, trigger workflows, and sometimes confidently explain why the wrong thing was the right thing to do.
The comparison that matters is not big versus small, but score versus stability
The paper evaluates seven models across four families: Hermes-4-70B, Hermes-4-405B, Qwen3-30B-A3B, Qwen3-235B-A22B, DeepSeek-R1-0528, gpt-oss-20b, and gpt-oss-120b. The authors test them on 19 multi-step reasoning problems across physics, mathematics, chemistry, economics, statistics, biology, calculus, and optimization.
Each problem is transformed in several ways. Some transformations are genuinely semantic-preserving: paraphrasing, reordering facts, expanding the wording, contracting the wording, or placing the problem in academic or business context. The contrastive transformation is different. It adds misleading comparative framing and is better read as a stress test or negative control, not a pure invariance test. That distinction is important; otherwise we would pretend the paper is measuring one clean thing when it is actually measuring both invariance and distractor resistance.
The researchers then compare models using several metrics:
| Metric | What it asks | Why it matters in deployment | ||
|---|---|---|---|---|
| Score | How close is the model’s solution to the reference solution? | Measures task performance on the problem. | ||
| Score delta | How much does score change after a transformation? | Measures sensitivity to wording and framing. | ||
| Mean Absolute Delta (MAD) | How large are those changes on average? Lower is better. | Measures overall instability. | ||
| Stability rate | How often is $ | \Delta | < 0.05$? Higher is better. | Measures how often a transformed prompt leaves performance effectively unchanged. |
| Semantic similarity | How similar are the reasoning traces? Higher is better. | Measures whether the reasoning path itself remains coherent. |
This is where the paper becomes useful for model selection. The highest-scoring model is not the most stable model.
Hermes-4-70B has the best overall score, at 0.667. Hermes-4-405B follows at 0.618. But the strongest robustness profile belongs to Qwen3-30B-A3B: lowest MAD at 0.049, highest stability rate at 79.6%, and highest semantic similarity at 0.914.
| Model | Score | MAD ↓ | Stability ↑ | Semantic similarity ↑ |
|---|---|---|---|---|
| Hermes-4-70B | 0.667 | 0.086 | 50.7% | 0.832 |
| Hermes-4-405B | 0.618 | 0.109 | 67.1% | 0.878 |
| Qwen3-235B-A22B | 0.529 | 0.072 | 69.7% | 0.891 |
| Qwen3-30B-A3B | 0.514 | 0.049 | 79.6% | 0.914 |
| DeepSeek-R1-0528 | 0.470 | 0.107 | 67.1% | 0.783 |
| gpt-oss-20b | 0.445 | 0.211 | 27.0% | 0.527 |
| gpt-oss-120b | 0.441 | 0.143 | 64.5% | 0.772 |
The obvious executive reading would be: “pick the highest score.” The paper says: not so fast.
A high score tells us the model solved the canonical version of the task. Robustness tells us whether the model keeps its footing when the user does not speak benchmark. In business workflows, the second question is often the more expensive one.
A finance analyst may ask for a risk summary one way. A portfolio manager may ask the same thing with different assumptions foregrounded. A junior employee may paste the same facts in a messy order. A customer may include irrelevant background. If the model’s conclusion changes materially across those cases, the organization does not have an AI agent. It has a probabilistic intern with excellent vocabulary and variable blood sugar.
Metamorphic testing turns “prompt sensitivity” into an audit method
The methodological contribution is the paper’s most practical one. The authors use metamorphic testing, a software testing idea designed for cases where the exact output may be hard to specify but the relationship between outputs should be predictable.
For example, if an image classifier recognizes a stop sign in daylight, it should still recognize it after a minor brightness adjustment. If a route planner finds the same path after irrelevant map labels are hidden, good. If it suddenly routes the truck through the ocean, less good.
For LLM agents, the equivalent relation is semantic invariance. The exact wording can change. The solution should not.
The paper’s transformations fall into three broad groups:
| Category | Transformation | Likely test purpose | What it supports | What it does not prove |
|---|---|---|---|---|
| Structural | Identity | Baseline stochasticity | Whether the model is stable even with the same prompt | Real-world paraphrase robustness |
| Structural | Paraphrase | Surface-form sensitivity | Whether wording changes alter reasoning | Robustness to misleading context |
| Structural | Reorder facts | Input-order sensitivity | Whether reasoning depends on fact sequence | Robustness when facts are missing |
| Verbosity | Expand | Extra-context filtering | Whether additional explanation helps or distracts | General long-context reliability |
| Verbosity | Contract | Essential-information extraction | Whether shorter phrasing preserves reasoning | Performance with genuinely incomplete prompts |
| Contextual | Academic context | Domain framing | Whether exam-style framing changes reasoning | Domain expertise in the academic field |
| Contextual | Business context | Operational framing | Whether practical framing changes reasoning | Production readiness by itself |
| Stress test | Contrastive | Distractor resistance | Whether misleading alternatives destabilize the model | Pure semantic invariance |
This table is worth lingering on because it is where many applied AI teams make a category mistake.
They treat prompt testing as a writing exercise: “Which wording gets the best answer?” The paper treats prompt testing as a reliability exercise: “Which meaning-preserving variations should not change the answer, but do?”
That shift matters. It turns prompt engineering from craft into measurement. Less poetry, more instrumentation. A tragic loss for prompt gurus, perhaps, but a gain for anyone who has to sign off on a production system.
The surprising result: the smaller Qwen model is the most stable
The first major finding is a scale-robustness inversion. Larger models do not automatically become more semantically stable.
Within Qwen3, the smaller Qwen3-30B-A3B achieves better robustness than Qwen3-235B-A22B. The smaller model has a MAD of 0.049 versus 0.072 for the larger one, a stability rate of 79.6% versus 69.7%, and slightly higher semantic similarity. The paper also notes that Qwen3-30B-A3B has only 3B active parameters, because it is a mixture-of-experts model.
This should not be overread as “small models are better.” That would be the kind of conclusion that travels well on social media and poorly in procurement. Hermes-4-70B still has the highest overall score. The better lesson is more specific: capability scale and invariance stability are separate dimensions.
A model can know more and still behave less consistently under reformulation. A model can be cheaper and smaller yet more stable for a particular reasoning workflow. For businesses, this changes the procurement question from:
Which model is best?
To:
Which model is best for this workflow, under the input variations this workflow actually produces?
That is a less glamorous question. It is also the one that reduces operational surprises.
Model families do not fail the same way
The second useful comparison is not between parameter counts, but between failure signatures.
The paper finds that different model families show different vulnerability profiles:
| Model family | Observed pattern | Practical reading |
|---|---|---|
| Hermes | Strong baseline performance, but vulnerable to contrastive framing | Good canonical reasoning may need extra guarding when prompts contain alternatives or misconceptions. |
| Qwen3 | Most balanced robustness profile, especially Qwen3-30B-A3B | Attractive candidate for stability-sensitive reasoning checks. |
| DeepSeek-R1 | Sensitive to structural transformations, especially fact reordering | Workflows with unordered or messy inputs may require preprocessing. |
| gpt-oss | Broad instability under several transformations | Needs careful validation before reliability-sensitive use. |
The specific numbers are revealing. Hermes models degrade under contrastive transformations: $\Delta = -0.126$ for Hermes-4-70B and $\Delta = -0.208$ for Hermes-4-405B. DeepSeek-R1 shows pronounced sensitivity to reordered facts, with $\Delta = -0.171$. The gpt-oss family has severe failures, including $\Delta = -0.449$ for gpt-oss-120b under contrastive framing and $\Delta = -0.270$ for gpt-oss-20b under fact reordering.
Qwen3 is not magic. It still degrades under contrastive framing. But compared with the other families, it shows the most balanced robustness profile, and Qwen3-30B-A3B keeps mean absolute delta below 0.05.
This is exactly the kind of information conventional benchmark leaderboards do not give you. A leaderboard compresses model quality into a ranking. Deployment needs a failure map.
A failure map says: this model handles business-context reframing well but becomes brittle when irrelevant alternatives are introduced. That model is good when facts are presented neatly but weak when order changes. Another model is cheap but erratic. This is less satisfying than a single score. It is also closer to engineering.
Contrastive framing is the universal foot-gun
The contrastive transformation is the most interesting stress test in the paper.
It introduces explicit comparisons with alternative scenarios or common misconceptions. For example, a transformed problem may say, in effect: solve this problem, but note that a different related case would be handled differently. The added material is not supposed to change the correct solution. It is there to test whether the model can keep relevant and irrelevant reasoning paths separate.
All models degrade under this transformation. The mean score delta ranges from $-0.088$ for Qwen3-30B-A3B to $-0.449$ for gpt-oss-120b. The paper also reports wider variance under contrastive framing, with outliers extending beyond $-0.6$ for gpt-oss-120b.
This is not just a prompt-engineering inconvenience. It is a realistic business condition.
Real prompts often contain contrastive material:
- “This is similar to last quarter, except the client is under a different contract.”
- “Do not treat this as a renewal; it is technically a reinstatement.”
- “Some customers think this fee is refundable, but our policy says otherwise.”
- “This case looks like a chargeback, but it may be a fraud review.”
Those sentences are operationally useful. They help humans avoid the wrong analogy. For LLMs, they can become cognitive bait.
The likely mechanism is not mysterious. LLMs process text through attention over tokens and learned associations. When a prompt includes an alternative scenario, the model may partially activate reasoning paths attached to that scenario. If it fails to separate “this is not the case” from “this case is relevant,” the output drifts. The paper does not prove a full mechanistic account, so we should not pretend it does. But the operational implication is clear: prompts that mention misleading alternatives need extra validation.
This is especially important for compliance, legal triage, medical decision support, fraud review, investment analysis, and technical troubleshooting. Those workflows often require distinguishing the true case from tempting near-cases. Unfortunately, “near but wrong” is precisely where the paper finds universal fragility.
Academic and business framing are less dangerous than misleading alternatives
One reassuring result: contextual reframing itself is not the main villain.
The paper reports that academic-context and business-context transformations show near-zero median deltas with minimal variance for most models. In other words, many models can handle the same problem framed as an exam question or as a professional scenario.
That matters because businesses often worry that models trained on academic benchmarks will fail when the same reasoning is wrapped in operational language. This paper suggests the concern should be more targeted. Domain-appropriate framing is not necessarily destabilizing. Misleading contrast is.
This distinction helps avoid lazy prompt rules. “Keep prompts short” is not enough. “Avoid business context” is not the lesson. The better rule is:
Separate task facts from distractors, exceptions, and near-cases; then test whether the model still reaches the same conclusion when those elements are rearranged or restated.
The expand transformation adds another nuance. Qwen3 models sometimes improve under expanded wording, while gpt-oss and DeepSeek degrade more substantially. Extra context can clarify for one architecture and overwhelm another. So the practical question is not whether longer prompts are good or bad. It is which model treats extra context as signal and which treats it as fog.
What businesses should directly take from the paper
The paper directly shows three things.
First, semantic invariance can be evaluated systematically. The method is not just “try a few prompts and see what happens.” It defines transformations, measures score change, and compares robustness across models.
Second, model rankings change when robustness is measured. Hermes-4-70B leads on raw score, while Qwen3-30B-A3B leads on robustness. If your workflow values consistency more than peak performance on a clean prompt, the procurement decision may change.
Third, failure patterns differ by model family and transformation type. That makes robustness testing useful not only for choosing a model but also for designing guardrails, preprocessing, routing, and human review.
Cognaptus would infer the following operational framework:
| Business decision | What the paper supports | Practical action | Boundary |
|---|---|---|---|
| Model procurement | Raw score and robustness can diverge | Evaluate candidate models on both task score and invariance stability | Paper uses 19 scientific reasoning problems, not every business domain |
| Prompt governance | Rewording and context can change results | Build a semantic-variation test suite for each critical workflow | Transformations must be validated as meaning-preserving |
| Agent orchestration | Model families have different weaknesses | Route tasks to models based on robustness profile, not brand prestige | Requires workflow-specific evidence |
| Risk control | Contrastive framing is broadly destabilizing | Add verifier agents or human review when prompts contain misleading alternatives | The paper tests single-inference behavior, not repeated consensus protocols |
| Cost optimization | Smaller models may be more stable in some cases | Test cheaper models as verifiers or primary agents for narrow workflows | Smaller is not automatically better |
This is the practical heart of the paper. It does not tell every company to adopt Qwen3-30B-A3B. It tells every company to stop selecting agents as if one benchmark number were a personality profile.
A better deployment checklist: test the same task in different clothes
For an applied AI team, semantic invariance testing can become a lightweight governance layer.
Start with a set of high-value workflow tasks. Do not begin with abstract benchmarks. Use real prompts or realistic synthetic tasks: invoice exceptions, customer support escalations, investment memo summaries, policy classification, risk scoring, technical diagnosis, procurement review.
For each task, create controlled variants:
- Paraphrase the user request without changing the facts.
- Reorder independent facts.
- Expand the prompt with clarifying but non-essential context.
- Contract the prompt to essential facts.
- Reframe it as a business case or formal review.
- Add a contrastive near-case that should not change the answer.
Then measure whether the output changes in unacceptable ways. The measurement does not need to copy the paper exactly. A business version can combine automatic scoring, embedding similarity, rule checks, and expert review.
The important part is the relation, not the metric fetish. If the task meaning is unchanged, the decision should remain unchanged. If it does not, the system needs redesign.
A minimal production test could look like this:
| Step | Question | Failure signal |
|---|---|---|
| Canonical prompt | Does the model solve the clean version? | Wrong answer or missing reasoning |
| Paraphrase test | Does wording change the conclusion? | Different recommendation without factual reason |
| Reorder test | Does fact order affect the output? | Earlier facts dominate later facts incorrectly |
| Expansion test | Does extra context distract the model? | Irrelevant details alter the decision |
| Contraction test | Does concise input preserve the answer? | Model invents missing details |
| Contrastive test | Can the model reject tempting wrong analogies? | Near-case contaminates reasoning |
| Verifier test | Does a second model detect inconsistency? | Disagreement without escalation |
This is not glamorous AI strategy. It is better: boring reliability work. The kind that prevents expensive incidents.
The paper’s boundaries matter
The study is valuable, but its scope should be read carefully.
The corpus has 19 multi-step reasoning problems. That is enough to reveal patterns, not enough to certify a model for all enterprise work. The domains are scientific and quantitative. They are relevant for reasoning, but not equivalent to legal document review, sales operations, clinical triage, trading signals, or government services.
The evaluation uses a single inference per problem-transformation pair, with temperature 0.7. This reflects realistic deployment behavior in one sense: many production systems do call a model once and act on the result. But it does not estimate the full distribution of outputs under repeated sampling. A model might look different under deterministic settings, self-consistency sampling, tool-verified reasoning, or ensemble voting.
Some transformations are LLM-assisted and manually validated. That is reasonable, but it introduces possible stylistic bias. The generator’s phrasing patterns may favor or penalize particular model families. The paper acknowledges this issue, and businesses should too: a semantic variation suite must be inspected by humans who understand the task.
Finally, contrastive framing is not strictly semantic-preserving. The paper treats it as a control or stress test. In business use, that is not a weakness; it is arguably the most realistic part of the study. But analytically, we should not mix it casually with pure invariance tests. It tells us about distractor resistance, not only semantic stability.
The conclusion: reliability is a relationship, not a property label
The easiest way to misread this paper is to turn it into another model ranking.
That would be unfortunate. The paper’s deeper message is that reliability is relational. A model is not simply “robust” or “not robust.” It is robust under certain transformations, for certain tasks, with certain prompt structures, under certain inference settings.
That is annoying. It is also how real systems work.
For Cognaptus-style automation, the implication is clear: every serious agentic workflow needs a semantic invariance test layer. Not because academic purity demands it, but because business users will inevitably ask the same thing in different words. They will reorder facts. They will add exceptions. They will mention near-cases. They will paste context that is useful, irrelevant, or actively misleading.
A benchmark tells you whether the model can answer the exam question.
Semantic invariance testing tells you whether it can survive the office.
That second test is less glamorous, less leaderboard-friendly, and much harder to compress into a marketing slide. Naturally, it is the one deployment teams actually need.
Cognaptus: Automate the Present, Incubate the Future.
-
I. de Zarzà, J. de Curtò, Jordi Cabot, Pietro Manzoni, and Carlos T. Calafate, “Semantic Invariance in Agentic AI,” arXiv:2603.13173v2, 2026. https://arxiv.org/abs/2603.13173 ↩︎