Opening — Why this matters now

Agentic AI is quickly becoming the operating system of modern automation. From financial analysis to medical triage, organizations increasingly deploy large language models (LLMs) not merely as chat interfaces but as reasoning agents capable of multi‑step decision making.

There is, however, an awkward question hiding behind the benchmarks:

What happens if the same problem is asked in slightly different words?

In traditional software, identical inputs produce identical outputs. With LLM agents, the situation is… more interpretive. A physics problem phrased as a textbook exercise might produce one answer; framed as a logistics scenario, another. Sometimes the difference is subtle. Occasionally it is catastrophic.

A recent research study titled Semantic Invariance in Agentic AI introduces a rigorous way to test this reliability problem. The paper proposes a metamorphic testing framework designed to measure whether LLM reasoning remains stable under semantically equivalent inputs. fileciteturn0file0

The results reveal something counter‑intuitive: bigger models are not necessarily more reliable.


Background — The benchmark illusion

Most AI evaluation today revolves around benchmark scores: MMLU, GSM8K, MATH, ARC, and similar datasets. These benchmarks measure accuracy using fixed problem formulations.

In effect, they assume something very convenient:

If a model solves a problem once, it will solve the same problem phrased differently.

Reality disagrees.

The research community has repeatedly observed that LLMs can be sensitive to small variations in phrasing, prompt order, or contextual framing. fileciteturn0file0

The problem becomes especially serious when LLMs operate as autonomous agents rather than passive chat tools. In business workflows, inputs are messy and unpredictable:

  • Users phrase requests differently
  • Context changes across domains
  • Systems inject additional information

If reasoning changes under these variations, the agent’s reliability collapses.

The concept used to capture this property is semantic invariance.

Semantic invariance

A reasoning system is semantically invariant if:

Equivalent inputs → Equivalent reasoning and conclusions.

Or formally:

$$ M(p) \equiv M(\tau(p)) $$

Where:

  • $p$ is the original problem
  • $\tau(p)$ is a semantically equivalent transformation
  • $M$ is the reasoning agent

In other words, wording should not matter.


Analysis — How the paper tests reasoning stability

The authors introduce a metamorphic testing framework, a technique originally developed for validating software systems when ground‑truth outputs are difficult to define. fileciteturn0file0

Instead of verifying a single correct answer, metamorphic testing checks whether relationships between outputs remain valid under controlled transformations.

For LLM reasoning agents, the metamorphic relation is simple:

If two problem statements mean the same thing, the solution should also be equivalent.

Eight semantic transformations

The framework applies eight transformations to each problem. These fall into three categories.

Category Transformation What Changes What Should NOT Change
Structural Paraphrase Different wording Meaning
Structural Fact reorder Order of information Logical solution
Structural Identity No change Baseline stability
Verbosity Expansion Extra explanation Final reasoning
Verbosity Contraction Remove redundant text Final reasoning
Contextual Academic framing Exam‑style wording Solution
Contextual Business framing Real‑world scenario Solution
Stress test Contrastive Add misleading alternatives Correct reasoning

The models tested include several major open‑model families:

Model Family Models Tested Architecture
Hermes 70B, 405B Dense Transformer
Qwen3 30B‑A3B, 235B‑A22B Mixture‑of‑Experts
DeepSeek R1‑0528 MoE reasoning model
GPT‑OSS 20B, 120B Dense Transformer

The evaluation used 19 multi‑step reasoning problems spanning physics, math, chemistry, economics, statistics, biology, calculus, and optimization. fileciteturn0file0


Findings — The strange economics of model reliability

The results challenge several common assumptions about AI capability.

1. Scale does not guarantee reliability

Contrary to intuition, the most robust model in the study was not the largest.

Model Mean Absolute Delta (MAD) Stability Rate
Qwen3‑30B‑A3B 0.049 79.6%
Qwen3‑235B‑A22B 0.072 69.7%
Hermes‑70B 0.086 50.7%
DeepSeek‑R1 0.107 67.1%
GPT‑OSS‑120B 0.143 64.5%
GPT‑OSS‑20B 0.211 27.0%

Lower MAD means the model’s answers change less when wording changes.

Surprisingly, the smaller Qwen3‑30B model was the most stable, outperforming models up to 100× larger in effective parameters. fileciteturn0file0

2. Each model family fails differently

The research identifies architectural “vulnerability signatures”:

Model Family Weakness Pattern
Hermes Sensitive to contrastive framing
Qwen3 Most balanced robustness
DeepSeek‑R1 Sensitive to fact ordering
GPT‑OSS Highly unstable under multiple transformations

This suggests robustness is architecture‑dependent, not purely scale‑dependent.

3. The universal weakness: misleading context

One transformation destabilized every model tested: contrastive framing.

Example structure:

“Solve the problem below. Note that some people mistakenly think…”

Even when the misleading context is irrelevant, the model’s reasoning often degrades.

Performance drops reached −0.45 score delta in the worst cases. fileciteturn0file0

This suggests attention‑based reasoning systems struggle when distractor information competes with relevant facts.


Implications — What this means for real AI systems

For businesses deploying agentic AI, the findings carry several practical lessons.

1. Benchmark accuracy is not reliability

A model that scores highly on benchmarks may still behave unpredictably when inputs vary.

Testing must include semantic perturbations, not just accuracy metrics.

2. Smaller models may be safer in production

The study’s results reinforce a growing theme in applied AI:

Reliability sometimes improves when models become simpler and more specialized.

This has direct implications for cost‑efficient AI deployment.

3. Agent orchestration should consider model weaknesses

Different models fail in different ways.

A robust multi‑agent system might deliberately combine models whose weaknesses do not overlap.

Example architecture:

Role Model Type
Primary reasoning High‑accuracy model
Verification Robust smaller model
Adversarial check Consistency‑focused model

This resembles how safety‑critical systems are engineered in aviation and finance.

4. Prompt design alone cannot solve the problem

Many teams assume prompt engineering fixes reliability issues.

The study suggests the root problem is architectural, not merely prompt‑level.


Conclusion — The next frontier of AI reliability

The history of AI evaluation has largely focused on capability.

The next phase will focus on stability.

As LLMs transition from chat assistants to autonomous agents, the key question becomes:

Not “Can the model solve the problem?” but “Will it solve the same problem the same way every time?”

Semantic invariance testing provides one of the first rigorous frameworks for answering that question.

And the early evidence suggests something humbling:

Even our most advanced reasoning models can lose their footing when the question is asked a little differently.

Which, for a reasoning system, is rather like forgetting that two plus two equals four — depending on how politely you ask.

Cognaptus: Automate the Present, Incubate the Future.