Opening — Why This Matters Now

We are entering the age of agentic AI. Not chatbots. Not autocomplete on steroids. Agents that search, retrieve, execute, and decide.

And here is the uncomfortable question:

If you run the same LLM agent on the same task twice — do you get the same behavior?

According to the recent empirical study “When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents” (arXiv:2602.11619v1), the answer is often no.

Across 3,000 runs on HotpotQA using ReAct-style agents, models produced between 2.0 and 4.2 distinct action trajectories per 10 identical executions. In other words, even when inputs are fixed, behavior is not.

More interestingly — and more importantly — that inconsistency strongly predicts failure.

For businesses deploying AI agents into workflows, this is not a philosophical curiosity. It is an operational risk.


Background — From Accuracy to Behavioral Stability

Most evaluation frameworks ask one question: Did the agent get the final answer right?

But LLM agents are not single-shot predictors. They are multi-step systems:

  • Generate reasoning
  • Select a tool
  • Execute
  • Observe
  • Repeat

Every step introduces branching.

Prior work like τ-bench measured outcome variance across repeated trials (“pass@k”). This paper goes further. Instead of only asking “How often do agents disagree?”, it asks:

  • Where does divergence begin?
  • Does divergence correlate with correctness?
  • Can consistency serve as a runtime reliability signal?

That shift — from output evaluation to trajectory analysis — is subtle but consequential.


Methodology — Measuring Behavioral Consistency

The researchers evaluated three major models in a controlled ReAct setup:

Model Provider Type
Claude Sonnet 4.5 Anthropic Closed-source
GPT-4o OpenAI Closed-source
Llama 3.1 70B Meta Open-source

Experimental Design

  • 100 “hard” HotpotQA questions
  • 10 runs per question per model
  • Temperature = 0.7 (main experiments)
  • Total runs = 3,000

Tools Available to the Agent

  • Search(query)
  • Retrieve(title)
  • Finish(answer)

Minimal toolset. Controlled environment. No web browsing chaos.

And still — substantial behavioral divergence.


Findings — Consistency Predicts Correctness

1. Behavioral Variance Is Real

Average unique action sequences per 10 runs:

Model Accuracy Unique Seqs Step Variance
Claude Sonnet 4.5 81.9% 2.0 18.1%
GPT-4o 74.0% 2.4 28.1%
Llama 3.1 70B 77.4% 4.2 55.0%

The open-source model exhibited more trajectory diversity. The most accurate model was also the most behaviorally stable.

That is not coincidence.


2. Consistency Gap = 32–55 Percentage Points

Tasks were grouped by trajectory diversity:

  • Consistent tasks: ≤2 unique sequences
  • Highly inconsistent tasks: ≥6 unique sequences

Accuracy comparison:

Model Consistent Inconsistent Gap
Claude Sonnet 4.5 84.8% 43.3% 41.5pp
GPT-4o 80.1% 25.0% 55.1pp
Llama 3.1 70B 92.0% 60.0% 32.0pp

Across models, inconsistent behavior predicted dramatically lower correctness.

This is the key insight:

Behavioral consistency is not cosmetic — it is predictive.

For production systems, this suggests a simple intervention:

Run multiple parallel executions. If early trajectories diverge, flag the task for retry or human review.


3. Divergence Happens Early (Step 2 Bottleneck)

69% of divergence occurred at Step 2 — the first search query.

That means the first tool invocation largely determines the trajectory.

If step 2 differs, downstream reasoning branches.

Operational implication:

  • Query formulation quality matters disproportionately.
  • Improving retrieval or constraining early action space may stabilize the entire system.

Early uncertainty compounds.


4. Path Length as a Reliability Signal

Longer trajectories correlated negatively with accuracy:

$$ r = -0.34 $$

Examples (Llama 3.1 70B):

  • 1 unique sequence → 3.4 steps avg → 85.7% correct
  • ≥9 sequences → 7.8 steps avg → 43% correct

Interpretation:

Long paths mean the agent is searching, backtracking, and uncertain.

Every additional step is another opportunity to diverge.

In complex enterprise workflows — with dozens of tools — this combinatorial explosion becomes a real reliability challenge.


5. Temperature Is a Lever — But Not a Cure

Temperature Accuracy Unique Seqs
0.7 77.4% 4.2
0.0 82.8% 2.2

Lowering temperature improved both consistency and accuracy.

However, even at temperature 0.0, divergence persisted.

This means:

  • Inconsistency is not just sampling noise.
  • Structural uncertainty in reasoning remains.

Temperature tuning helps — but architecture matters more.


Implications — What This Means for Real-World AI Deployment

1. Introduce Consistency Monitoring

Instead of evaluating only final answers, measure:

  • Number of unique trajectories
  • First divergence point
  • Step count variance

These can serve as runtime health metrics.

Think of it as behavioral observability for AI agents.


2. Select Models by Stability, Not Just Benchmarks

The most capable model in the study was also the most consistent.

If your application requires high reliability (finance, healthcare, compliance), behavioral stability may be more valuable than raw benchmark accuracy.


3. Invest in Early-Step Quality

Since 69% of divergence happens at step 2:

  • Improve search prompts
  • Use query expansion
  • Integrate learned retrievers
  • Add guardrails around initial tool calls

Small improvements at the beginning have outsized impact.


4. Complexity Multiplies Variance

This study used only three tools.

Enterprise agents may have:

  • API access
  • Databases
  • File systems
  • External web browsing
  • Multi-agent communication

Each additional branch multiplies divergence risk.

Consistency challenges grow combinatorially.

Ignoring that would be optimistic.


Broader Perspective — Capability and Stability

An intriguing pattern emerges:

Higher-performing models exhibited lower behavioral variance.

Is consistency a byproduct of capability? Or does architectural refinement inherently stabilize reasoning trajectories?

The paper cannot answer definitively — but the correlation is suggestive.

If so, the path to more reliable agents may not just be better guardrails.

It may be better models.


Conclusion — Reliability Is a Trajectory Property

We often evaluate agents by their final answers.

But this research shows that reliability is embedded in the path, not just the endpoint.

Agents that agree with themselves tend to be correct. Agents that disagree with themselves tend to fail.

That is a surprisingly human pattern.

For organizations deploying AI agents into real workflows, one principle becomes clear:

Monitor behavior, not just outcomes.

Because consistency is not a coincidence.

It is a signal.


Cognaptus: Automate the Present, Incubate the Future.