A support ticket arrives. The agent reads the same customer history, sees the same policy document, and has access to the same tools. On Monday, it searches for the refund rule, retrieves the correct clause, and gives a clean answer. On Tuesday, with the same input, it searches for a different phrase, retrieves a less relevant document, wanders through two extra steps, and ends with a confident answer that is only approximately useful.
Nothing crashed. No API failed. No one changed the prompt. The system simply disagreed with itself.
That is the uncomfortable starting point of When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents, a paper that studies not whether an LLM agent can solve a task once, but whether it behaves similarly when asked to solve the same task repeatedly.1 This distinction matters more than it first appears. In ordinary software, repeated execution under the same input is supposed to be boring. In agentic AI, boredom has become an achievement.
The paper’s central claim is not that agents are stochastic. Everyone has learned that lesson, usually after watching a chatbot explain the same thing in three styles and one mild hallucination. The sharper claim is that inconsistency in the intermediate behavior of an agent — its tool calls, action paths, step count, and first point of divergence — can predict whether the final answer is likely to be correct.
That moves reliability analysis away from the final answer alone. The answer is the receipt. The action path is the transaction record.
The real failure mode starts before the final answer
Many business discussions of LLM reliability still orbit around the final output: Was the answer right? Was the recommendation compliant? Did the generated report contain a hallucination? These are necessary questions, but they arrive late.
Agent systems create a more useful trace. A ReAct-style agent does not simply emit an answer. It alternates between reasoning and action: think, search, retrieve, observe, think again, and eventually finish. In this paper, the agent has three tools:
| Tool | Operational meaning |
|---|---|
Search(query) |
Find matching document titles from the HotpotQA context using keyword matching |
Retrieve(title) |
Retrieve the full text of a selected document |
Finish(answer) |
Stop execution and provide the final answer |
This is a deliberately small action space. The point is not to recreate a full enterprise workflow with calendars, CRMs, databases, code execution, and ticketing tools. The point is to isolate a mechanism: given the same task, does the agent follow the same path?
The authors run 3,000 experiments: 100 hard HotpotQA questions, 10 repeated runs per question, across three models — Llama 3.1 70B, GPT-4o, and Claude Sonnet 4.5. Each question is multi-hop, with two gold paragraphs and eight distractors. That is enough complexity for search choices to matter, but still controlled enough that the traces can be compared.
The paper then measures consistency along several dimensions:
| Metric | What it captures | Why it matters |
|---|---|---|
| Answer consistency | Whether repeated runs converge on the same final answer | A familiar output-level reliability signal |
| Action sequence diversity | How many distinct tool-action paths appear across repeated runs | A behavior-level instability signal |
| Step variance ratio | How much trajectory length varies | A proxy for wandering, backtracking, or uncertainty |
| First divergence point | The earliest step where repeated runs split | A diagnostic clue about where reliability breaks |
| Correctness | Whether the final answer matches the gold answer under fuzzy matching | The downstream outcome |
This measurement design is the paper’s first useful contribution. It treats agent behavior as something observable, not just as mysterious “reasoning” hidden behind a final answer. That is exactly the kind of shift enterprise AI needs. A production agent should not be judged only by whether its final message sounds reasonable. It should be judged by whether its path through evidence and tools looks stable enough to trust.
Consistency is not the same as correctness, but it is suspiciously informative
The paper’s headline result is simple: more consistent agents are more likely to be correct.
Across the three models, tasks with consistent behavior — defined in the paper’s comparison as two unique action sequences — achieve 80–92% accuracy. Highly inconsistent tasks — six unique action sequences — fall to 25–60% accuracy. The gap ranges from 32 to 55 percentage points depending on the model.
| Model | Accuracy on consistent tasks | Accuracy on inconsistent tasks | Gap |
|---|---|---|---|
| Claude Sonnet 4.5 | 84.8% | 43.3% | 41.5 pp |
| GPT-4o | 80.1% | 25.0% | 55.1 pp |
| Llama 3.1 70B | 92.0% | 60.0% | 32.0 pp |
The exact ordering should be read carefully. Llama has the largest behavioral variance overall, yet its consistent-task accuracy is very high. GPT-4o has lower overall accuracy in this setup than Claude and Llama, but its inconsistency penalty is especially severe. Claude is both the most accurate and the most behaviorally stable in the aggregate table.
That last point is tempting to turn into a model-ranking headline. Resist the temptation. The experiment covers three models, one benchmark, one agent design, and a constrained tool environment. It is useful evidence, not a universal procurement memo.
The better interpretation is structural: when an agent solves the same problem through many different paths, that variation often reflects uncertainty rather than creativity. Sometimes variation is harmless. Sometimes there are multiple valid routes to the same answer. But in this setting, high path diversity is associated with lower correctness.
For business use, this is the important replacement belief:
| Common belief | Better replacement |
|---|---|
| “If the final answer is confident, it is probably fine.” | Confidence is cheap; stable evidence-seeking behavior is more informative. |
| “Repeated runs are only useful for majority voting.” | Repeated runs can also reveal whether the agent’s tool path is unstable. |
| “Agent reliability is mostly an output-evaluation problem.” | Some reliability signals appear before the final answer. |
| “Inconsistency just means the model is creative.” | In tool-using agents, inconsistency may mean the agent does not know where to look. |
This is not a call to run every enterprise agent ten times forever. That would be a delightful way to triple cloud bills and annoy the finance department. The point is more selective: repeated-run consistency can become a diagnostic tool, especially for high-risk tasks, uncertain tasks, or offline evaluation.
The first search query is where the agent quietly chooses its fate
The most interesting result is not merely that inconsistency predicts lower accuracy. It is where the inconsistency begins.
For Llama 3.1 70B, the model with the highest variance in the study, 69% of first divergence occurs at step 2. In this setup, step 2 is the first search query after the initial reasoning step.
| First divergence point | Number of tasks | Share | Average correctness |
|---|---|---|---|
| Step 1 | 1 | — | — |
| Step 2 | 59 | 69% | 71.7% |
| Step 3 or later | 26 | 30% | 85.8% |
This is a useful result because it localizes the failure. The agent does not slowly become unreliable after a long chain of reasoning. Much of the variance appears almost immediately, when the agent decides how to translate the question into a search query.
That should feel familiar to anyone building retrieval-augmented generation or workflow automation. A system’s first query often determines the evidence universe. A slightly different query can retrieve a different document, which changes the next thought, which changes the next tool call, which changes the final answer. By the time the answer arrives, the “reasoning” may look coherent, but the path has already drifted.
In business terms, the first query is not a small implementation detail. It is a routing decision.
For enterprise agents, this suggests several practical interventions:
| Intervention | Mechanism | When it helps |
|---|---|---|
| Query rewriting | Generate and normalize better initial search queries | Knowledge-base search, policy lookup, research agents |
| Query expansion | Include synonyms, entity variants, and structured filters | Customer support, legal retrieval, compliance search |
| Retrieval validation | Check whether retrieved evidence actually contains required entities | High-risk document QA |
| Parallel first-query comparison | Run multiple candidate first searches and compare overlap | Expensive but useful for critical tasks |
| Early divergence alarm | Flag tasks where repeated runs split at the first tool call | Human review, retry, escalation |
Notice the difference between this and generic “improve the prompt” advice. The paper points to a specific bottleneck: the first tool-using action. That is where instrumentation should begin.
Longer paths are not deeper thinking by default
The paper also reports that path length correlates with outcomes. In the Llama analysis, perfectly consistent tasks average 3.4 steps and reach 85.7% correctness. Highly inconsistent tasks average 7.8 steps and reach 43% correctness.
This result is easy to misread. It does not mean short answers are always better, or that long reasoning is bad. Some tasks genuinely require more steps. A financial due diligence agent that checks five sources is not automatically worse than one that checks two. The lesson is narrower: in this controlled setup, longer paths often look less like careful verification and more like uncertainty accumulating across decisions.
A useful operational distinction is:
| Path pattern | Likely interpretation | Practical response |
|---|---|---|
| Short, stable path across repeated runs | The agent finds the same evidence route repeatedly | Lower review priority |
| Long but stable path | The task may be complex but procedurally controlled | Review only final evidence quality |
| Short but unstable path | The agent jumps among alternative evidence routes quickly | Check query formulation and retrieval |
| Long and unstable path | The agent is probably wandering | Retry, constrain tools, or escalate |
The last category is the expensive one. Every additional action creates another chance for divergence. In a three-tool benchmark, this is already visible. In a real workflow with dozens of tools, user permissions, database calls, code execution, and external APIs, it becomes less charming.
This is where the business implication becomes serious. Agent observability should not stop at logs for debugging. Logs should be converted into reliability features: path length, tool-call variance, early divergence, retrieval overlap, evidence reuse, and finish timing. A dashboard showing only final accuracy is useful for a demo. A dashboard showing behavioral stability is useful for operations.
The supporting tests are useful, but they are not a second thesis
Two additional analyses in the paper deserve attention because they clarify the boundary of the main claim.
The first is the temperature ablation. For Llama 3.1 70B on a 20-question subset, reducing temperature from 0.7 to 0.0 improves correctness from 77.4% to 82.8% and reduces unique action sequences from 4.2 to 2.2.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Temperature ablation | Robustness and diagnostic lower-bound test | Sampling temperature contributes to behavioral variance | That temperature 0.0 is always optimal in production |
| Question-type analysis | Exploratory extension | Different task formats can separate correctness from consistency | That one universal consistency metric fits all tasks |
| Model comparison | Main evidence plus practical comparison | Consistency and correctness differ across model families | A universal ranking of model reliability |
| First divergence analysis | Main mechanism evidence | Early tool-choice divergence is a key bottleneck | That step 2 is always the bottleneck in all agent systems |
| Path-length analysis | Supporting mechanism evidence | Wandering trajectories often signal lower reliability | That all long trajectories are bad |
The temperature result is practically relevant. Lower temperature may improve stability. But it should not be treated as magic. The authors themselves frame temperature 0.0 as a diagnostic lower bound, while the main results use non-zero temperature to reflect realistic deployment. In production, lower temperature can reduce variation, but it may also reduce useful exploration in tasks where alternative reasoning paths are valuable.
The second supporting analysis compares bridge questions and comparison questions. Bridge questions show 75.7% correctness, 76.6% answer consistency, and 63% step variance for Llama. Comparison questions show higher correctness at 80.0%, but lower answer consistency at 62.4% and lower step variance at 41%.
This looks odd until the task format is considered. Comparison questions often have constrained answer spaces, such as yes/no. A model can land on the correct final answer even if its explanation path varies. That distinction matters. Answer consistency, explanation consistency, and action consistency are related but not identical.
For business agents, this means consistency metrics must be task-aware. A yes/no eligibility checker and a legal research assistant should not be evaluated in exactly the same way. A short-form classification task may tolerate more explanation variation if the decision is stable and auditable. A research or compliance agent may require evidence-path stability, not just answer agreement.
What this means for enterprise agent design
The paper directly shows three things.
First, ReAct-style agents can produce materially different action sequences across repeated runs, even when the input is identical. Second, higher behavioral consistency is associated with higher correctness in the tested HotpotQA setting. Third, early divergence, especially in the first search query, explains a large share of behavioral variance for the most variable model in the study.
Cognaptus would infer a practical design principle from this:
Agent reliability should be monitored at the trajectory level, not only at the answer level.
That principle can be translated into an operational workflow.
| Stage | What to monitor | Possible action |
|---|---|---|
| Before deployment | Repeated-run path diversity on benchmark tasks | Identify unstable task categories |
| During low-risk execution | Tool path, step count, retrieval overlap | Store reliability features for later review |
| During high-risk execution | Parallel or repeated dry-runs | Escalate if early paths diverge |
| After failure | First divergence point | Fix query generation, retrieval, or tool routing |
| Model selection | Accuracy plus path stability | Avoid choosing based on one-shot benchmark performance only |
This is not glamorous. It is not the kind of thing that produces a beautiful keynote slide with a robot shaking hands with a regional manager. It is plumbing. But reliability usually lives in plumbing.
The most immediate use case is selective escalation. Instead of sending every output to human review, an enterprise system can flag tasks where behavioral signals are weak: early divergence, high path diversity, long wandering trajectories, or low retrieval agreement. This makes review budget more targeted.
A second use case is automated retry. If repeated runs diverge at the first search query, the system may not need a stronger final-answer verifier. It may need a better query generator. A retry can be designed around that: rewrite the query, force entity extraction before search, or retrieve documents through a structured filter instead of free-form search.
A third use case is model evaluation. Vendors and internal teams often compare models by final task accuracy. The paper suggests adding behavioral stability to the model card. A model that is slightly more accurate but much less stable may be harder to operate. Conversely, a model that is stable in tool use may reduce downstream supervision costs even if its one-shot accuracy advantage is modest.
The business value is cheaper diagnosis, not mystical trust
The phrase “trustworthy AI” is usually where precision goes to retire. This paper offers something more practical: cheaper diagnosis.
If an agent fails, there are several possible causes:
| Failure source | Observable clue |
|---|---|
| Bad retrieval query | Divergence at first search; low overlap in retrieved documents |
| Weak evidence selection | Same query but different retrieved or used documents |
| Reasoning instability | Same evidence but different intermediate conclusions |
| Premature finishing | Short path with poor evidence coverage |
| Wandering | Long path, high step variance, repeated search-retrieve cycles |
| Model uncertainty | High answer and path variance across repeated runs |
This is operationally useful because different causes require different fixes. A retrieval-query problem should not be solved by adding a longer system prompt about “being careful.” A reasoning instability problem should not be solved only by changing the vector database. A premature finishing problem may need tool-use constraints. A wandering problem may need step limits, planning structure, or escalation.
The paper’s most valuable business implication is therefore not “run agents multiple times.” That is a tactic. The deeper implication is that repeated trajectories let teams distinguish failure modes that look identical at the final-answer layer.
Two wrong answers may be wrong in different ways. One may follow the same bad path every time, suggesting a systematic retrieval or knowledge-base issue. Another may vary wildly across runs, suggesting uncertainty or underspecified routing. The remediation should differ.
Where the evidence stops
The limitations are not decorative. They define how far the result can travel.
The study uses one benchmark: HotpotQA in the distractor setting. The sample is 100 hard questions. The agent has only three tools. Search is lexical over the provided context. The main experiment uses three models. The temperature ablation uses a smaller 20-question subset. Correctness is measured through fuzzy string matching, which is reasonable for the benchmark but not equivalent to business-grade correctness in legal, financial, medical, or operational settings.
The paper also studies a retrieval-style question-answering environment. That is relevant to many enterprise workflows, but it is not the same as coding agents, spreadsheet agents, autonomous procurement agents, web-navigation agents, or multi-agent planning systems. In those settings, action spaces are larger, states are more persistent, and errors can compound through external side effects.
That said, the limitation cuts both ways. The experiment observes substantial behavioral variance in a small, controlled action space. A real enterprise agent with more tools may not become magically more stable. More tools can mean more opportunities to diverge. Of course, better architecture can also impose structure: planners, typed tool schemas, retrieval validation, constrained decoding, deterministic routers, and human-in-the-loop gates can all reduce variance.
So the correct business conclusion is not universal pessimism. It is measurement discipline. Before claiming that an agent is reliable, measure whether it behaves consistently under repeated execution. Then inspect where it diverges.
The article-worthy lesson: instability is a signal, not just a nuisance
The paper is useful because it changes the diagnostic question.
Instead of asking only:
Did the agent produce the right answer?
we should also ask:
Did the agent reach that answer through a stable and inspectable path?
That second question is not academic housekeeping. It affects deployment. It affects review cost. It affects whether failures can be debugged. It affects whether a model that looks strong in a demo can survive routine business operations.
For Cognaptus-style automation, the natural framework is:
- Measure trajectory stability during evaluation, not after incidents.
- Identify first divergence points to locate unstable workflow decisions.
- Treat early tool-choice variance as a risk signal, especially in retrieval-heavy workflows.
- Use repeated runs selectively, where the cost of failure justifies the diagnostic expense.
- Separate answer agreement from evidence-path agreement, because some tasks need one more than the other.
- Tune architecture, not only temperature, because sampling noise is only part of the problem.
The slightly sarcastic version: if your agent cannot agree with itself on where to look first, perhaps do not let it approve refunds, rewrite contracts, or operate a trading workflow without supervision. It may still be useful. It may even be very useful. But useful and autonomous are not synonyms, no matter how many product pages imply otherwise.
The more serious version: behavioral consistency gives teams a practical early-warning signal. It does not replace correctness evaluation. It does not prove truth. It does not eliminate the need for human judgment in high-stakes contexts. But it gives builders one more observable layer between raw model output and business consequence.
That layer is where reliable agent systems will be built: not in the poetry of “autonomy,” but in the boring, measurable discipline of watching what the agent actually does.
Cognaptus: Automate the Present, Incubate the Future.
-
Aman Mehta, “When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents,” arXiv:2602.11619, 2026. ↩︎