Opening — Why This Matters Now
We are entering the age of agentic AI. Not chatbots. Not autocomplete on steroids. Agents that search, retrieve, execute, and decide.
And here is the uncomfortable question:
If you run the same LLM agent on the same task twice — do you get the same behavior?
According to the recent empirical study “When Agents Disagree With Themselves: Measuring Behavioral Consistency in LLM-Based Agents” (arXiv:2602.11619v1), the answer is often no.
Across 3,000 runs on HotpotQA using ReAct-style agents, models produced between 2.0 and 4.2 distinct action trajectories per 10 identical executions. In other words, even when inputs are fixed, behavior is not.
More interestingly — and more importantly — that inconsistency strongly predicts failure.
For businesses deploying AI agents into workflows, this is not a philosophical curiosity. It is an operational risk.
Background — From Accuracy to Behavioral Stability
Most evaluation frameworks ask one question: Did the agent get the final answer right?
But LLM agents are not single-shot predictors. They are multi-step systems:
- Generate reasoning
- Select a tool
- Execute
- Observe
- Repeat
Every step introduces branching.
Prior work like τ-bench measured outcome variance across repeated trials (“pass@k”). This paper goes further. Instead of only asking “How often do agents disagree?”, it asks:
- Where does divergence begin?
- Does divergence correlate with correctness?
- Can consistency serve as a runtime reliability signal?
That shift — from output evaluation to trajectory analysis — is subtle but consequential.
Methodology — Measuring Behavioral Consistency
The researchers evaluated three major models in a controlled ReAct setup:
| Model | Provider | Type |
|---|---|---|
| Claude Sonnet 4.5 | Anthropic | Closed-source |
| GPT-4o | OpenAI | Closed-source |
| Llama 3.1 70B | Meta | Open-source |
Experimental Design
- 100 “hard” HotpotQA questions
- 10 runs per question per model
- Temperature = 0.7 (main experiments)
- Total runs = 3,000
Tools Available to the Agent
Search(query)Retrieve(title)Finish(answer)
Minimal toolset. Controlled environment. No web browsing chaos.
And still — substantial behavioral divergence.
Findings — Consistency Predicts Correctness
1. Behavioral Variance Is Real
Average unique action sequences per 10 runs:
| Model | Accuracy | Unique Seqs | Step Variance |
|---|---|---|---|
| Claude Sonnet 4.5 | 81.9% | 2.0 | 18.1% |
| GPT-4o | 74.0% | 2.4 | 28.1% |
| Llama 3.1 70B | 77.4% | 4.2 | 55.0% |
The open-source model exhibited more trajectory diversity. The most accurate model was also the most behaviorally stable.
That is not coincidence.
2. Consistency Gap = 32–55 Percentage Points
Tasks were grouped by trajectory diversity:
- Consistent tasks: ≤2 unique sequences
- Highly inconsistent tasks: ≥6 unique sequences
Accuracy comparison:
| Model | Consistent | Inconsistent | Gap |
|---|---|---|---|
| Claude Sonnet 4.5 | 84.8% | 43.3% | 41.5pp |
| GPT-4o | 80.1% | 25.0% | 55.1pp |
| Llama 3.1 70B | 92.0% | 60.0% | 32.0pp |
Across models, inconsistent behavior predicted dramatically lower correctness.
This is the key insight:
Behavioral consistency is not cosmetic — it is predictive.
For production systems, this suggests a simple intervention:
Run multiple parallel executions. If early trajectories diverge, flag the task for retry or human review.
3. Divergence Happens Early (Step 2 Bottleneck)
69% of divergence occurred at Step 2 — the first search query.
That means the first tool invocation largely determines the trajectory.
If step 2 differs, downstream reasoning branches.
Operational implication:
- Query formulation quality matters disproportionately.
- Improving retrieval or constraining early action space may stabilize the entire system.
Early uncertainty compounds.
4. Path Length as a Reliability Signal
Longer trajectories correlated negatively with accuracy:
$$ r = -0.34 $$
Examples (Llama 3.1 70B):
- 1 unique sequence → 3.4 steps avg → 85.7% correct
- ≥9 sequences → 7.8 steps avg → 43% correct
Interpretation:
Long paths mean the agent is searching, backtracking, and uncertain.
Every additional step is another opportunity to diverge.
In complex enterprise workflows — with dozens of tools — this combinatorial explosion becomes a real reliability challenge.
5. Temperature Is a Lever — But Not a Cure
| Temperature | Accuracy | Unique Seqs |
|---|---|---|
| 0.7 | 77.4% | 4.2 |
| 0.0 | 82.8% | 2.2 |
Lowering temperature improved both consistency and accuracy.
However, even at temperature 0.0, divergence persisted.
This means:
- Inconsistency is not just sampling noise.
- Structural uncertainty in reasoning remains.
Temperature tuning helps — but architecture matters more.
Implications — What This Means for Real-World AI Deployment
1. Introduce Consistency Monitoring
Instead of evaluating only final answers, measure:
- Number of unique trajectories
- First divergence point
- Step count variance
These can serve as runtime health metrics.
Think of it as behavioral observability for AI agents.
2. Select Models by Stability, Not Just Benchmarks
The most capable model in the study was also the most consistent.
If your application requires high reliability (finance, healthcare, compliance), behavioral stability may be more valuable than raw benchmark accuracy.
3. Invest in Early-Step Quality
Since 69% of divergence happens at step 2:
- Improve search prompts
- Use query expansion
- Integrate learned retrievers
- Add guardrails around initial tool calls
Small improvements at the beginning have outsized impact.
4. Complexity Multiplies Variance
This study used only three tools.
Enterprise agents may have:
- API access
- Databases
- File systems
- External web browsing
- Multi-agent communication
Each additional branch multiplies divergence risk.
Consistency challenges grow combinatorially.
Ignoring that would be optimistic.
Broader Perspective — Capability and Stability
An intriguing pattern emerges:
Higher-performing models exhibited lower behavioral variance.
Is consistency a byproduct of capability? Or does architectural refinement inherently stabilize reasoning trajectories?
The paper cannot answer definitively — but the correlation is suggestive.
If so, the path to more reliable agents may not just be better guardrails.
It may be better models.
Conclusion — Reliability Is a Trajectory Property
We often evaluate agents by their final answers.
But this research shows that reliability is embedded in the path, not just the endpoint.
Agents that agree with themselves tend to be correct. Agents that disagree with themselves tend to fail.
That is a surprisingly human pattern.
For organizations deploying AI agents into real workflows, one principle becomes clear:
Monitor behavior, not just outcomes.
Because consistency is not a coincidence.
It is a signal.
Cognaptus: Automate the Present, Incubate the Future.