Consistency Is Not a Coincidence: When LLM Agents Disagree With Themselves

A support ticket arrives. The agent reads the same customer history, sees the same policy document, and has access to the same tools. On Monday, it searches for the refund rule, retrieves the correct clause, and gives a clean answer. On Tuesday, with the same input, it searches for a different phrase, retrieves a less relevant document, wanders through two extra steps, and ends with a confident answer that is only approximately useful.

Nothing crashed. No API failed. No one changed the prompt. The system simply disagreed with itself.

That is the uncomfortable starting point of When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents, a paper that studies not whether an LLM agent can solve a task once, but whether it behaves similarly when asked to solve the same task repeatedly.¹ This distinction matters more than it first appears. In ordinary software, repeated execution under the same input is supposed to be boring. In agentic AI, boredom has become an achievement.

The paper’s central claim is not that agents are stochastic. Everyone has learned that lesson, usually after watching a chatbot explain the same thing in three styles and one mild hallucination. The sharper claim is that inconsistency in the intermediate behavior of an agent — its tool calls, action paths, step count, and first point of divergence — can predict whether the final answer is likely to be correct.

That moves reliability analysis away from the final answer alone. The answer is the receipt. The action path is the transaction record.

The real failure mode starts before the final answer

Many business discussions of LLM reliability still orbit around the final output: Was the answer right? Was the recommendation compliant? Did the generated report contain a hallucination? These are necessary questions, but they arrive late.

Agent systems create a more useful trace. A ReAct-style agent does not simply emit an answer. It alternates between reasoning and action: think, search, retrieve, observe, think again, and eventually finish. In this paper, the agent has three tools:

Tool	Operational meaning
`Search(query)`	Find matching document titles from the HotpotQA context using keyword matching
`Retrieve(title)`	Retrieve the full text of a selected document
`Finish(answer)`	Stop execution and provide the final answer

This is a deliberately small action space. The point is not to recreate a full enterprise workflow with calendars, CRMs, databases, code execution, and ticketing tools. The point is to isolate a mechanism: given the same task, does the agent follow the same path?

The authors run 3,000 experiments: 100 hard HotpotQA questions, 10 repeated runs per question, across three models — Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5. Each question is multi-hop, with two gold paragraphs and eight distractors. That is enough complexity for search choices to matter, but still controlled enough that the traces can be compared.

The paper then measures consistency along several dimensions:

Metric	What it captures	Why it matters
Answer consistency	Whether repeated runs converge on the same final answer	A familiar output-level reliability signal
Action sequence diversity	How many distinct tool-action paths appear across repeated runs	A behavior-level instability signal
Step variance ratio	How much trajectory length varies	A proxy for wandering, backtracking, or uncertainty
First divergence point	The earliest step where repeated runs split	A diagnostic clue about where reliability breaks
Correctness	Whether the final answer matches the gold answer under fuzzy matching	The downstream outcome

This measurement design is the paper’s first useful contribution. It treats agent behavior as something observable, not just as mysterious “reasoning” hidden behind a final answer. That is exactly the kind of shift enterprise AI needs. A production agent should not be judged only by whether its final message sounds reasonable. It should be judged by whether its path through evidence and tools looks stable enough to trust.

Consistency is not the same as correctness, but it is suspiciously informative

The paper’s headline result is simple: more consistent agents are more likely to be correct.

Across the three models, tasks with consistent behavior — defined in the paper’s comparison as two unique action sequences — achieve 80–92% accuracy. Highly inconsistent tasks — six unique action sequences — fall to 25–60% accuracy. The gap ranges from 32 to 55 percentage points depending on the model.

Model	Accuracy on consistent tasks	Accuracy on inconsistent tasks	Gap
Claude Sonnet 4.5	84.8%	43.3%	41.5 pp
GPT-4o	80.1%	25.0%	55.1 pp
Llama 3.1 70B	92.0%	60.0%	32.0 pp

The exact ordering should be read carefully. Llama has the largest behavioral variance overall, yet its consistent-task accuracy is very high. GPT-4o has lower overall accuracy in this setup than Claude and Llama, but its inconsistency penalty is especially severe. Claude is both the most accurate and the most behaviorally stable in the aggregate table.

That last point is tempting to turn into a model-ranking headline. Resist the temptation. The experiment covers three models, one benchmark, one agent design, and a constrained tool environment. It is useful evidence, not a universal procurement memo.

The better interpretation is structural: when an agent solves the same problem through many different paths, that variation often reflects uncertainty rather than creativity. Sometimes variation is harmless. Sometimes there are multiple valid routes to the same answer. But in this setting, high path diversity is associated with lower correctness.

For business use, this is the important replacement belief:

Common belief	Better replacement
“If the final answer is confident, it is probably fine.”	Confidence is cheap; stable evidence-seeking behavior is more informative.
“Repeated runs are only useful for majority voting.”	Repeated runs can also reveal whether the agent’s tool path is unstable.
“Agent reliability is mostly an output-evaluation problem.”	Some reliability signals appear before the final answer.
“Inconsistency just means the model is creative.”	In tool-using agents, inconsistency may mean the agent does not know where to look.

This is not a call to run every enterprise agent ten times forever. That would be a delightful way to triple cloud bills and annoy the finance department. The point is more selective: repeated-run consistency can become a diagnostic tool, especially for high-risk tasks, uncertain tasks, or offline evaluation.

The first search query is where the agent quietly chooses its fate

The most interesting result is not merely that inconsistency predicts lower accuracy. It is where the inconsistency begins.

For Llama 3.1 70B, the model with the highest variance in the study, 69% of first divergence occurs at step 2. In this setup, step 2 is the first search query after the initial reasoning step.

First divergence point	Number of tasks	Share	Average correctness
Step 1	1	—	—
Step 2	59	69%	71.7%
Step 3 or later	26	30%	85.8%

This is a useful result because it localizes the failure. The agent does not slowly become unreliable after a long chain of reasoning. Much of the variance appears almost immediately, when the agent decides how to translate the question into a search query.

That should feel familiar to anyone building retrieval-augmented generation or workflow automation. A system’s first query often determines the evidence universe. A slightly different query can retrieve a different document, which changes the next thought, which changes the next tool call, which changes the final answer. By the time the answer arrives, the “reasoning” may look coherent, but the path has already drifted.

In business terms, the first query is not a small implementation detail. It is a routing decision.

For enterprise agents, this suggests several practical interventions:

Intervention	Mechanism	When it helps
Query rewriting	Generate and normalize better initial search queries	Knowledge-base search, policy lookup, research agents
Query expansion	Include synonyms, entity variants, and structured filters	Customer support, legal retrieval, compliance search
Retrieval validation	Check whether retrieved evidence actually contains required entities	High-risk document QA
Parallel first-query comparison	Run multiple candidate first searches and compare overlap	Expensive but useful for critical tasks
Early divergence alarm	Flag tasks where repeated runs split at the first tool call	Human review, retry, escalation

Notice the difference between this and generic “improve the prompt” advice. The paper points to a specific bottleneck: the first tool-using action. That is where instrumentation should begin.

Longer paths are not deeper thinking by default

The paper also reports that path length correlates with outcomes. In the Llama analysis, perfectly consistent tasks average 3.4 steps and reach 85.7% correctness. Highly inconsistent tasks average 7.8 steps and reach 43% correctness.

This result is easy to misread. It does not mean short answers are always better, or that long reasoning is bad. Some tasks genuinely require more steps. A financial due diligence agent that checks five sources is not automatically worse than one that checks two. The lesson is narrower: in this controlled setup, longer paths often look less like careful verification and more like uncertainty accumulating across decisions.

A useful operational distinction is:

Path pattern	Likely interpretation	Practical response
Short, stable path across repeated runs	The agent finds the same evidence route repeatedly	Lower review priority
Long but stable path	The task may be complex but procedurally controlled	Review only final evidence quality
Short but unstable path	The agent jumps among alternative evidence routes quickly	Check query formulation and retrieval
Long and unstable path	The agent is probably wandering	Retry, constrain tools, or escalate

The last category is the expensive one. Every additional action creates another chance for divergence. In a three-tool benchmark, this is already visible. In a real workflow with dozens of tools, user permissions, database calls, code execution, and external APIs, it becomes less charming.

This is where the business implication becomes serious. Agent observability should not stop at logs for debugging. Logs should be converted into reliability features: path length, tool-call variance, early divergence, retrieval overlap, evidence reuse, and finish timing. A dashboard showing only final accuracy is useful for a demo. A dashboard showing behavioral stability is useful for operations.

The supporting tests are useful, but they are not a second thesis

Two additional analyses in the paper deserve attention because they clarify the boundary of the main claim.

The first is the temperature ablation. For Llama 3.1 70B on a 20-question subset, reducing temperature from 0.7 to 0.0 improves correctness from 77.4% to 82.8% and reduces unique action sequences from 4.2 to 2.2.

Test	Likely purpose	What it supports	What it does not prove
Temperature ablation	Robustness and diagnostic lower-bound test	Sampling temperature contributes to behavioral variance	That temperature 0.0 is always optimal in production
Question-type analysis	Exploratory extension	Different task formats can separate correctness from consistency	That one universal consistency metric fits all tasks
Model comparison	Main evidence plus practical comparison	Consistency and correctness differ across model families	A universal ranking of model reliability
First divergence analysis	Main mechanism evidence	Early tool-choice divergence is a key bottleneck	That step 2 is always the bottleneck in all agent systems
Path-length analysis	Supporting mechanism evidence	Wandering trajectories often signal lower reliability	That all long trajectories are bad

The temperature result is practically relevant. Lower temperature may improve stability. But it should not be treated as magic. The authors themselves frame temperature 0.0 as a diagnostic lower bound, while the main results use non-zero temperature to reflect realistic deployment. In production, lower temperature can reduce variation, but it may also reduce useful exploration in tasks where alternative reasoning paths are valuable.

The second supporting analysis compares bridge questions and comparison questions. Bridge questions show 75.7% correctness, 76.6% answer consistency, and 63% step variance for Llama. Comparison questions show higher correctness at 80.0%, but lower answer consistency at 62.4% and lower step variance at 41%.

This looks odd until the task format is considered. Comparison questions often have constrained answer spaces, such as yes/no. A model can land on the correct final answer even if its explanation path varies. That distinction matters. Answer consistency, explanation consistency, and action consistency are related but not identical.

For business agents, this means consistency metrics must be task-aware. A yes/no eligibility checker and a legal research assistant should not be evaluated in exactly the same way. A short-form classification task may tolerate more explanation variation if the decision is stable and auditable. A research or compliance agent may require evidence-path stability, not just answer agreement.

What this means for enterprise agent design

The paper directly shows three things.

First, ReAct-style agents can produce materially different action sequences across repeated runs, even when the input is identical. Second, higher behavioral consistency is associated with higher correctness in the tested HotpotQA setting. Third, early divergence, especially in the first search query, explains a large share of behavioral variance for the most variable model in the study.

Cognaptus would infer a practical design principle from this:

Agent reliability should be monitored at the trajectory level, not only at the answer level.

That principle can be translated into an operational workflow.

Stage	What to monitor	Possible action
Before deployment	Repeated-run path diversity on benchmark tasks	Identify unstable task categories
During low-risk execution	Tool path, step count, retrieval overlap	Store reliability features for later review
During high-risk execution	Parallel or repeated dry-runs	Escalate if early paths diverge
After failure	First divergence point	Fix query generation, retrieval, or tool routing
Model selection	Accuracy plus path stability	Avoid choosing based on one-shot benchmark performance only

This is not glamorous. It is not the kind of thing that produces a beautiful keynote slide with a robot shaking hands with a regional manager. It is plumbing. But reliability usually lives in plumbing.

The most immediate use case is selective escalation. Instead of sending every output to human review, an enterprise system can flag tasks where behavioral signals are weak: early divergence, high path diversity, long wandering trajectories, or low retrieval agreement. This makes review budget more targeted.

A second use case is automated retry. If repeated runs diverge at the first search query, the system may not need a stronger final-answer verifier. It may need a better query generator. A retry can be designed around that: rewrite the query, force entity extraction before search, or retrieve documents through a structured filter instead of free-form search.

A third use case is model evaluation. Vendors and internal teams often compare models by final task accuracy. The paper suggests adding behavioral stability to the model card. A model that is slightly more accurate but much less stable may be harder to operate. Conversely, a model that is stable in tool use may reduce downstream supervision costs even if its one-shot accuracy advantage is modest.

The business value is cheaper diagnosis, not mystical trust

The phrase “trustworthy AI” is usually where precision goes to retire. This paper offers something more practical: cheaper diagnosis.

If an agent fails, there are several possible causes:

Failure source	Observable clue
Bad retrieval query	Divergence at first search; low overlap in retrieved documents
Weak evidence selection	Same query but different retrieved or used documents
Reasoning instability	Same evidence but different intermediate conclusions
Premature finishing	Short path with poor evidence coverage
Wandering	Long path, high step variance, repeated search-retrieve cycles
Model uncertainty	High answer and path variance across repeated runs

This is operationally useful because different causes require different fixes. A retrieval-query problem should not be solved by adding a longer system prompt about “being careful.” A reasoning instability problem should not be solved only by changing the vector database. A premature finishing problem may need tool-use constraints. A wandering problem may need step limits, planning structure, or escalation.

The paper’s most valuable business implication is therefore not “run agents multiple times.” That is a tactic. The deeper implication is that repeated trajectories let teams distinguish failure modes that look identical at the final-answer layer.

Two wrong answers may be wrong in different ways. One may follow the same bad path every time, suggesting a systematic retrieval or knowledge-base issue. Another may vary wildly across runs, suggesting uncertainty or underspecified routing. The remediation should differ.

Where the evidence stops

The limitations are not decorative. They define how far the result can travel.

The study uses one benchmark: HotpotQA in the distractor setting. The sample is 100 hard questions. The agent has only three tools. Search is lexical over the provided context. The main experiment uses three models. The temperature ablation uses a smaller 20-question subset. Correctness is measured through fuzzy string matching, which is reasonable for the benchmark but not equivalent to business-grade correctness in legal, financial, medical, or operational settings.

The paper also studies a retrieval-style question-answering environment. That is relevant to many enterprise workflows, but it is not the same as coding agents, spreadsheet agents, autonomous procurement agents, web-navigation agents, or multi-agent planning systems. In those settings, action spaces are larger, states are more persistent, and errors can compound through external side effects.

That said, the limitation cuts both ways. The experiment observes substantial behavioral variance in a small, controlled action space. A real enterprise agent with more tools may not become magically more stable. More tools can mean more opportunities to diverge. Of course, better architecture can also impose structure: planners, typed tool schemas, retrieval validation, constrained decoding, deterministic routers, and human-in-the-loop gates can all reduce variance.

So the correct business conclusion is not universal pessimism. It is measurement discipline. Before claiming that an agent is reliable, measure whether it behaves consistently under repeated execution. Then inspect where it diverges.

The article-worthy lesson: instability is a signal, not just a nuisance

The paper is useful because it changes the diagnostic question.

Instead of asking only:

Did the agent produce the right answer?

we should also ask:

Did the agent reach that answer through a stable and inspectable path?

That second question is not academic housekeeping. It affects deployment. It affects review cost. It affects whether failures can be debugged. It affects whether a model that looks strong in a demo can survive routine business operations.

For Cognaptus-style automation, the natural framework is:

Measure trajectory stability during evaluation, not after incidents.
Identify first divergence points to locate unstable workflow decisions.
Treat early tool-choice variance as a risk signal, especially in retrieval-heavy workflows.
Use repeated runs selectively, where the cost of failure justifies the diagnostic expense.
Separate answer agreement from evidence-path agreement, because some tasks need one more than the other.
Tune architecture, not only temperature, because sampling noise is only part of the problem.

The slightly sarcastic version: if your agent cannot agree with itself on where to look first, perhaps do not let it approve refunds, rewrite contracts, or operate a trading workflow without supervision. It may still be useful. It may even be very useful. But useful and autonomous are not synonyms, no matter how many product pages imply otherwise.

The more serious version: behavioral consistency gives teams a practical early-warning signal. It does not replace correctness evaluation. It does not prove truth. It does not eliminate the need for human judgment in high-stakes contexts. But it gives builders one more observable layer between raw model output and business consequence.

That layer is where reliable agent systems will be built: not in the poetry of “autonomy,” but in the boring, measurable discipline of watching what the agent actually does.

Cognaptus: Automate the Present, Incubate the Future.

Aman Mehta, “When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents,” arXiv:2602.11619, 2026. ↩︎

The real failure mode starts before the final answer#

Consistency is not the same as correctness, but it is suspiciously informative#

The first search query is where the agent quietly chooses its fate#

Longer paths are not deeper thinking by default#

The supporting tests are useful, but they are not a second thesis#

What this means for enterprise agent design#

The business value is cheaper diagnosis, not mystical trust#

Where the evidence stops#

The article-worthy lesson: instability is a signal, not just a nuisance#