Same Question, Different Words — Why LLM Agents Lose Their Minds

Opening — Why this matters now

Agentic AI is quickly becoming the operating system of modern automation. From financial analysis to medical triage, organizations increasingly deploy large language models (LLMs) not merely as chat interfaces but as reasoning agents capable of multi‑step decision making.

There is, however, an awkward question hiding behind the benchmarks:

What happens if the same problem is asked in slightly different words?

In traditional software, identical inputs produce identical outputs. With LLM agents, the situation is… more interpretive. A physics problem phrased as a textbook exercise might produce one answer; framed as a logistics scenario, another. Sometimes the difference is subtle. Occasionally it is catastrophic.

A recent research study titled Semantic Invariance in Agentic AI introduces a rigorous way to test this reliability problem. The paper proposes a metamorphic testing framework designed to measure whether LLM reasoning remains stable under semantically equivalent inputs. fileciteturn0file0

The results reveal something counter‑intuitive: bigger models are not necessarily more reliable.

Background — The benchmark illusion

Most AI evaluation today revolves around benchmark scores: MMLU, GSM8K, MATH, ARC, and similar datasets. These benchmarks measure accuracy using fixed problem formulations.

In effect, they assume something very convenient:

If a model solves a problem once, it will solve the same problem phrased differently.

Reality disagrees.

The research community has repeatedly observed that LLMs can be sensitive to small variations in phrasing, prompt order, or contextual framing. fileciteturn0file0

The problem becomes especially serious when LLMs operate as autonomous agents rather than passive chat tools. In business workflows, inputs are messy and unpredictable:

Users phrase requests differently
Context changes across domains
Systems inject additional information

If reasoning changes under these variations, the agent’s reliability collapses.

The concept used to capture this property is semantic invariance.

Semantic invariance

A reasoning system is semantically invariant if:

Equivalent inputs → Equivalent reasoning and conclusions.

Or formally:

$$ M(p) \equiv M(\tau(p)) $$

Where:

$p$ is the original problem
$\tau(p)$ is a semantically equivalent transformation
$M$ is the reasoning agent

In other words, wording should not matter.

Analysis — How the paper tests reasoning stability

The authors introduce a metamorphic testing framework, a technique originally developed for validating software systems when ground‑truth outputs are difficult to define. fileciteturn0file0

Instead of verifying a single correct answer, metamorphic testing checks whether relationships between outputs remain valid under controlled transformations.

For LLM reasoning agents, the metamorphic relation is simple:

If two problem statements mean the same thing, the solution should also be equivalent.

Eight semantic transformations

The framework applies eight transformations to each problem. These fall into three categories.

Category	Transformation	What Changes	What Should NOT Change
Structural	Paraphrase	Different wording	Meaning
Structural	Fact reorder	Order of information	Logical solution
Structural	Identity	No change	Baseline stability
Verbosity	Expansion	Extra explanation	Final reasoning
Verbosity	Contraction	Remove redundant text	Final reasoning
Contextual	Academic framing	Exam‑style wording	Solution
Contextual	Business framing	Real‑world scenario	Solution
Stress test	Contrastive	Add misleading alternatives	Correct reasoning

The models tested include several major open‑model families:

Model Family	Models Tested	Architecture
Hermes	70B, 405B	Dense Transformer
Qwen3	30B‑A3B, 235B‑A22B	Mixture‑of‑Experts
DeepSeek	R1‑0528	MoE reasoning model
GPT‑OSS	20B, 120B	Dense Transformer

The evaluation used 19 multi‑step reasoning problems spanning physics, math, chemistry, economics, statistics, biology, calculus, and optimization. fileciteturn0file0

Findings — The strange economics of model reliability

The results challenge several common assumptions about AI capability.

1. Scale does not guarantee reliability

Contrary to intuition, the most robust model in the study was not the largest.

Model	Mean Absolute Delta (MAD)	Stability Rate
Qwen3‑30B‑A3B	0.049	79.6%
Qwen3‑235B‑A22B	0.072	69.7%
Hermes‑70B	0.086	50.7%
DeepSeek‑R1	0.107	67.1%
GPT‑OSS‑120B	0.143	64.5%
GPT‑OSS‑20B	0.211	27.0%

Lower MAD means the model’s answers change less when wording changes.

Surprisingly, the smaller Qwen3‑30B model was the most stable, outperforming models up to 100× larger in effective parameters. fileciteturn0file0

2. Each model family fails differently

The research identifies architectural “vulnerability signatures”:

Model Family	Weakness Pattern
Hermes	Sensitive to contrastive framing
Qwen3	Most balanced robustness
DeepSeek‑R1	Sensitive to fact ordering
GPT‑OSS	Highly unstable under multiple transformations

This suggests robustness is architecture‑dependent, not purely scale‑dependent.

3. The universal weakness: misleading context

One transformation destabilized every model tested: contrastive framing.

Example structure:

“Solve the problem below. Note that some people mistakenly think…”

Even when the misleading context is irrelevant, the model’s reasoning often degrades.

Performance drops reached −0.45 score delta in the worst cases. fileciteturn0file0

This suggests attention‑based reasoning systems struggle when distractor information competes with relevant facts.

Implications — What this means for real AI systems

For businesses deploying agentic AI, the findings carry several practical lessons.

1. Benchmark accuracy is not reliability

A model that scores highly on benchmarks may still behave unpredictably when inputs vary.

Testing must include semantic perturbations, not just accuracy metrics.

2. Smaller models may be safer in production

The study’s results reinforce a growing theme in applied AI:

Reliability sometimes improves when models become simpler and more specialized.

This has direct implications for cost‑efficient AI deployment.

3. Agent orchestration should consider model weaknesses

Different models fail in different ways.

A robust multi‑agent system might deliberately combine models whose weaknesses do not overlap.

Example architecture:

Role	Model Type
Primary reasoning	High‑accuracy model
Verification	Robust smaller model
Adversarial check	Consistency‑focused model

This resembles how safety‑critical systems are engineered in aviation and finance.

4. Prompt design alone cannot solve the problem

Many teams assume prompt engineering fixes reliability issues.

The study suggests the root problem is architectural, not merely prompt‑level.

Conclusion — The next frontier of AI reliability

The history of AI evaluation has largely focused on capability.

The next phase will focus on stability.

As LLMs transition from chat assistants to autonomous agents, the key question becomes:

Not “Can the model solve the problem?” but “Will it solve the same problem the same way every time?”

Semantic invariance testing provides one of the first rigorous frameworks for answering that question.

And the early evidence suggests something humbling:

Even our most advanced reasoning models can lose their footing when the question is asked a little differently.

Which, for a reasoning system, is rather like forgetting that two plus two equals four — depending on how politely you ask.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — The benchmark illusion#

Semantic invariance#

Analysis — How the paper tests reasoning stability#

Eight semantic transformations#

Findings — The strange economics of model reliability#

1. Scale does not guarantee reliability#

2. Each model family fails differently#

3. The universal weakness: misleading context#

Implications — What this means for real AI systems#

1. Benchmark accuracy is not reliability#

2. Smaller models may be safer in production#

3. Agent orchestration should consider model weaknesses#

4. Prompt design alone cannot solve the problem#

Conclusion — The next frontier of AI reliability#