When Agents Loop: Geometry, Drift, and the Hidden Physics of LLM Behavior

Agents are rarely dangerous because they answer once.

They become interesting, and occasionally annoying, when they loop.

A customer-support agent drafts a reply, critiques it, revises it, checks policy, rewrites the tone, and sends the result back into another reasoning step. A research agent summarizes papers, updates its plan, searches again, and revises its own assumptions. A coding agent edits a file, reads the error, patches the patch, and keeps going until either the tests pass or the repository looks like an archaeological site.

Most evaluation still asks a simple endpoint question: did the final answer work?

That is necessary. It is also incomplete. A loop is not just a final answer with extra steps. It is a trajectory. The important question is not only where the agent ends, but how it moves: does it settle, drift, oscillate, or wander away from the task while sounding increasingly confident about its vacation?

Nicolas Tacheny’s paper, Geometric Dynamics of Agentic Loops in Large Language Models, gives this movement a vocabulary and a measurement protocol.¹ The paper treats recursive LLM processes as dynamical systems in semantic space. In plain language: each generated text is embedded as a vector, each iteration becomes a point in a trajectory, and the loop’s behavior can be inspected geometrically rather than judged only by the final output.

The useful part is not the physics metaphor. Metaphors are cheap; dashboards are not. The useful part is the comparison. The same local model, the same initial sentence, the same 50-step recursive setup, and two different prompts produce two sharply different regimes. One prompt makes the loop contract toward a stable semantic region. The other makes it drift through disconnected ideas with no detected attractor.

So the uncomfortable lesson is simple: agent behavior is not just a property of the base model. It is also a property of the loop operator we build around it. The model is the engine. The prompt loop is the steering geometry. Pretending otherwise is how teams end up calling instability “emergence,” which sounds much better in slides than in production logs.

The paper turns agent loops into trajectories, not anecdotes

The paper begins from a gap in how iterative LLM systems are usually evaluated. Self-refinement, chain-of-thought workflows, ReAct-style agents, Reflexion-like improvement loops, and multi-agent systems all involve repeated transformations. An output becomes the next input. Yet much of the evaluation culture focuses on the endpoint: final answer accuracy, task success, reward, or user satisfaction.

That misses the temporal structure.

A recursive loop can improve for five steps and then degrade. It can stabilize around a useful answer. It can stabilize around a wrong answer. It can oscillate between two framings. It can drift farther from the original task while preserving fluent language. Endpoint metrics can see some of this after the damage is done. Trajectory metrics can see the motion while it is happening.

Tacheny formalizes an agentic loop as a recursive transformation:

$$ a_{t+1} = F(a_t) $$

Here, $a_t$ is the textual artifact at step $t$, and $F$ is the transformation induced by the LLM plus the prompt template. Since text itself is discrete and difficult to measure geometrically, each artifact is projected into an embedding space:

$$ e_t = \psi(a_t) $$

The result is a semantic trajectory: a sequence of embedded states ${e_0, e_1, …, e_T}$. Once the loop becomes a trajectory, we can measure how it moves.

The paper then defines operational versions of several concepts:

Concept	Meaning in the paper	Operational use
Local similarity	Similarity between consecutive iterations, $\tilde{s}(e_t, e_{t-1})$	Measures whether each step is a small adjustment or a large semantic jump
Global similarity	Similarity between current state and initial state, $\tilde{s}(e_t, e_0)$	Measures whether the loop stays near the starting meaning or drifts away
Dispersion	Maximum deviation from a cluster’s semantic center	Measures how tightly a set of iterations is concentrated
Cluster	A contiguous window of semantically coherent states	Detects stable phases in the loop
Attractor	The center of a stable cluster	Represents the region the loop appears to settle around
Regime	Contractive, oscillatory, or exploratory behavior	Classifies the loop’s overall movement pattern

The paper uses calibrated semantic similarity rather than raw cosine similarity. The calibration is meant to make similarity scores more interpretable and better aligned with human judgments. The high-confidence similarity threshold used in the experiments is approximately $\tilde{\tau}_{HCS} \approx 0.65$. The paper also argues that its qualitative conclusions depend mainly on similarity ordering, so the framework is not supposed to live or die by one exact numeric calibration.

That is a reasonable design choice. In production terms, it means the paper is less about a sacred threshold and more about the shape of the movement: tight, loose, recurring, or unstable.

Same model, same sentence, different loop physics

The experiment is intentionally minimal. This is not a benchmark of complex reasoning agents. It is closer to a controlled laboratory setup: isolate one recursive operation and observe its trajectory.

The model is deepseek-r1:8b, run locally through Ollama, with temperature set to 0.8. Each loop runs for $T = 50$ iterations. Both experiments start from the same sentence:

“Music has the power to connect people across cultures and generations.”

The two prompts are designed as archetypal opposites.

The contractive prompt asks the model to rewrite the sentence so it sounds slightly more natural and fluent while preserving the meaning exactly. This is a stabilization-friendly instruction: improve style, do not change substance.

The exploratory prompt asks the model to summarize the current text in one sentence and then negate its main idea completely in an abstract way. This is a destabilization-friendly instruction: identify meaning, then invert it.

This experimental design matters because it removes an easy excuse. If one loop stabilizes and the other drifts, we cannot simply say, “That is because the model is unstable.” The model is held constant. The recursive operator changes.

That is the paper’s business-relevant nerve.

Many teams still discuss model behavior as if it were mostly model selection: GPT versus Claude versus Gemini versus an open-source model running in a box under someone’s desk. Model choice matters, obviously. But once a model is placed inside a recursive workflow, the loop design becomes part of the system’s behavior. A prompt is no longer just an instruction. Repeated over time, it becomes a transformation rule.

A single prompt is a sentence. A recursive prompt is a machine.

The paraphrase loop settles into an attractor

The contractive loop behaves as the name suggests. Its local similarity remains high, in the range of roughly 0.82 to 0.95. In ordinary terms, each step changes the sentence only modestly. The loop is not making wild conceptual jumps; it is polishing.

The global similarity to the original sentence stabilizes near 0.75 after the initial iterations. This is important. The trajectory does not remain identical to the starting sentence, but it does remain bounded. It moves away enough to reformulate the sentence, then stops wandering.

The appendix sequence makes the movement intuitive. The initial sentence about music connecting people across cultures and generations becomes a series of increasingly polished variations. Over time, the wording changes, but the semantic core remains: music unifies people across cultural, geographical, and temporal boundaries. By the later iterations, the sentence is mostly circling the same basin of meaning with stylistic variations.

The cluster analysis gives this observation a formal shape. Under the intermediate clustering configuration $(\lambda, \rho, \kappa) = (0.8, 0.2, 2)$, the algorithm detects two persistent clusters:

Cluster	Iteration range	Dispersion	Outliers	Interpretation
Cluster 1	0–18	0.1773	6	Initial transient refinement phase
Cluster 2	21–50	0.1500	9	Final convergence basin

The paper treats this as evidence of a contractive regime. The loop first explores nearby paraphrases, then settles into a more stable region. The second cluster is longer and tighter, suggesting that the recursive operator has found a semantic basin where further iterations do not substantially alter meaning.

The robustness check is also worth reading correctly. Appendix B tests the same contractive trajectory under multiple clustering resolutions. At a finer radius, $\rho = 0.1$, five micro-clusters appear. At a coarser radius, $\rho = 0.3$, a single cluster covers the trajectory. This is not a second thesis. It is a sensitivity test: the number of clusters changes with resolution, but the qualitative conclusion remains stable. The loop shows temporal persistence, high inter-cluster similarity, and decreasing dispersion.

That distinction matters. A careless summary might say: “The loop has five clusters, or two clusters, or one cluster.” A better reading says: the exact granularity depends on the clustering lens, but all lenses show bounded, coherent movement. The result is not “two clusters.” The result is contraction.

The negation loop keeps moving because the prompt keeps kicking it

The exploratory loop is the contrast case. The local similarity between consecutive embeddings fluctuates widely, mostly between about 0.2 and 0.6, sometimes dropping close to zero. The global similarity to the starting sentence remains mostly below 0.2.

This is not polishing. This is semantic ricochet.

The appendix sequence makes the behavior almost comic. The music sentence becomes a claim about the impossibility of transcendent musical connection. Then it becomes a claim about interconnected resonances. Then coherent thought is declared an illusion. Then consciousness appears. Then reality becomes mundane, then elusive, then objective, then subjective, then fixed, then malleable. The loop does not merely negate the initial sentence; it repeatedly resets the conceptual frame.

The clustering result is blunt: no valid cluster is detected under any tested configuration. The loop does not stay in one semantic neighborhood long enough to satisfy the cluster conditions. It has grammatical coherence, but no geometric stability.

This is where the comparison earns its keep.

The contractive loop and exploratory loop are not two models. They are two recursive prompt-defined operators. The paraphrase operator applies a small semantic force: preserve meaning, improve expression. The summarize-then-negate operator applies a disruptive force: compress the current idea, then invert it abstractly. Over 50 iterations, those forces accumulate into distinct regimes.

Test	Likely purpose	What it supports	What it does not prove
Same initial sentence across both loops	Controlled comparison	Differences are not due to starting content	Results generalize to all starting texts
Same model and local inference setup	Operator isolation	Prompt-defined transformation strongly affects regime	Base model never matters
50-step paraphrase loop	Main evidence for contraction	Stable local similarity, bounded drift, detected attractors	Infinite-horizon stability
50-step negation loop	Main evidence for exploration	Low local/global similarity, no clusters	All creative prompts behave this way
Multi-resolution cluster analysis	Robustness/sensitivity test	Contractive conclusion is not an artifact of one $\rho$	Exact cluster count is universal
Discussion of composite loops	Exploratory extension	Suggests future exploration–consolidation designs	Validated production architecture

The paper’s Table 2 summarizes the contrast compactly: the contractive loop has mean local similarity above 0.85, global similarity around 0.75, detected clusters, and a contractive regime; the exploratory loop has local similarity below 0.5, global similarity below 0.2, no clusters, and an exploratory regime.

The business translation is not “use paraphrasing prompts.” That would be a very small lesson, and not even a particularly interesting one. The stronger point is that recursive systems should be monitored as moving systems. If the loop is supposed to refine, contraction is healthy. If it is supposed to brainstorm, some expansion is healthy. If it is supposed to investigate while preserving task context, unchecked drift is not creativity. It is a slow-motion escape.

Stability is not always good; drift is not always bad

A bad reading of this paper would turn “contractive” into “good” and “exploratory” into “bad.”

That is too simple. It also sounds like something a dashboard vendor would put in a white paper five minutes before adding a traffic-light widget.

Contraction is useful when the task requires stabilization: finalizing a customer reply, cleaning a legal clause, refining a policy answer, normalizing extracted data, or turning rough notes into a consistent executive summary. In these cases, semantic drift is usually a defect. The agent should converge toward a bounded answer, not keep discovering new philosophical positions about the nature of reality.

Exploration is useful when the task requires novelty: brainstorming product angles, generating hypotheses, searching conceptual alternatives, stress-testing assumptions, or deliberately reframing a strategy. In these cases, excessive contraction may be premature closure. A system that stabilizes too quickly can become fluent, consistent, and boring. In some organizations, this is called “brand safe.” Fine. But do not confuse it with thinking.

The operational question is not “Should agents converge?” The question is: “What regime should this loop be in right now?”

A useful agent may need both regimes at different stages:

Explore alternatives.
Select promising candidates.
Contract around a chosen direction.
Check for hidden drift.
Re-open exploration only when evidence demands it.

The paper itself points toward this idea in its discussion of composite loops. Alternating exploratory and contractive phases could, in principle, support controlled creativity: divergence for novelty, contraction for coherence. The paper is careful here. It does not empirically validate creativity-specific claims; it provides measurement infrastructure. That restraint is refreshing. One almost forgets this is AI research.

What businesses can actually use from this

The direct paper result is narrow: two singular recursive loops, one local model, one embedding setup, one initial sentence, 50 iterations, and two extreme prompts.

The business inference is broader but should remain disciplined: if recursive LLM workflows are becoming part of operations, then trajectory monitoring should become part of AI operations.

Most agent observability today focuses on logs, tool calls, latency, cost, errors, and final task success. Those are necessary. They do not fully answer whether the agent’s reasoning or generated content is semantically stabilizing, drifting, or cycling.

The paper suggests a semantic monitoring layer. For each recursive workflow, store the output at each step, embed it, and track a small set of indicators:

Indicator	Operational question	Possible trigger
Local similarity	Is each step making small refinements or large jumps?	Alert if a refinement loop suddenly makes large semantic moves
Global similarity	Is the agent still near the original task or source material?	Escalate if similarity to the task brief falls below a threshold
Dispersion	Are recent iterations concentrated or scattered?	Stop or reset if dispersion rises during supposed convergence
Cluster persistence	Is the loop settling into a stable basin?	Terminate early once stable convergence is detected
Absence of clusters	Is the loop failing to stabilize?	Switch prompt, reduce temperature, or route to human review
Recurrent clusters	Is the loop oscillating between positions?	Add tie-breaker criteria or change the decision policy

This is not a replacement for quality evaluation. A stable loop can converge to a wrong answer. A drifting loop can stumble into a useful idea. Geometry does not know truth. It knows movement.

That boundary is crucial. Semantic trajectory monitoring is best understood as a diagnostic layer, not a judge. It can tell a team that an agent is no longer behaving like a refinement system. It cannot tell whether the answer is legally correct, medically safe, financially compliant, or strategically wise.

For business deployment, the ROI relevance is mostly in cheaper diagnosis and earlier intervention. Instead of waiting for a final bad answer, teams can detect when a loop’s behavior stops matching its intended role.

Workflow type	Desired regime	What trajectory monitoring catches
Customer response refinement	Contractive	The answer starts drifting from the customer issue
Contract clause simplification	Contractive with high source similarity	The agent changes legal meaning while improving readability
Research ideation	Controlled exploratory	The agent explores but loses the research objective
Multi-agent debate	Oscillatory or exploratory followed by contraction	Agents repeat positions without synthesis
Data extraction correction	Strongly contractive	Repeated “fixes” introduce new fields or interpretations
Strategic planning assistant	Alternating exploration and contraction	The system brainstorms forever and never consolidates

The most immediate product implication is not glamorous: build drift checks into agent loops. Before adding another orchestration framework, another agent persona, or another expensive model routing layer, measure whether the loop is moving the way the workflow expects.

Yes, this is less exciting than announcing “autonomous enterprise cognition.” It is also more likely to prevent a support bot from reinventing the customer’s complaint in the style of a motivational poster.

The hidden design variable is the operator, not the prompt text alone

One subtle strength of the paper is that it treats the prompt as part of an operator. That sounds technical, but it clarifies a common mistake.

A prompt used once is an instruction. A prompt used recursively is a behavioral rule.

The difference is compounding. A mild instruction repeated 50 times can become a strong force. “Make this slightly more natural while preserving meaning” repeatedly suppresses semantic movement. “Negate the main idea completely in an abstract way” repeatedly injects semantic reversal. The system’s trajectory is not determined by one answer; it is determined by the repeated transformation.

This matters for agent builders because many workflows are designed step by step rather than dynamically. Teams define a critique prompt, a revision prompt, a planning prompt, a reflection prompt, and then assume that if each step sounds reasonable, the loop will behave reasonably.

The paper suggests a harder question: what is the effective geometry of the composition?

For a single loop, the operator is:

$$ F = LLM \circ P $$

For a composite workflow, the operator may become a chain of different prompts, models, temperatures, tools, and memory updates. The resulting trajectory could contract, expand, oscillate, or do something less polite.

That is why trajectory-level testing should happen before production. Not just “run the agent on 20 examples and inspect the final answer,” but “run the agent for multiple iterations and inspect the path.”

A practical test suite could include:

short-horizon convergence tests for refinement loops;
source-faithfulness drift tests for summarization loops;
oscillation detection for debate or critique-revise workflows;
exploration width tests for ideation loops;
early-stop rules based on cluster persistence;
reset rules when global similarity to the task falls too far.

None of this requires believing that embedding geometry is a perfect map of meaning. It only requires accepting that a noisy movement sensor is better than driving with the dashboard covered because the road is philosophically complex.

Boundaries: what the paper shows, and what it does not

The paper is useful because it is precise. Its limitations should be handled with the same precision.

First, the empirical evidence uses one LLM: deepseek-r1:8b, with fixed decoding settings and temperature 0.8. Different models and sampling configurations may show different trajectories. A larger model might preserve instruction constraints differently. A lower temperature might reduce exploratory volatility. A different decoding strategy might alter both regimes.

Second, the trajectory is observed through one embedding function and calibrated similarity setup. The paper argues that qualitative conclusions are order-based and robust to monotonic transformations, but alternative embedding models can reshape semantic neighborhoods. In production, teams should validate monitoring metrics against their own data and failure modes.

Third, the horizon is 50 iterations. The contractive loop shows no sign of leaving its attractor within that window, but this is not a proof of infinite stability. The exploratory loop shows no cluster within that window, but the paper cannot formally exclude late convergence.

Fourth, the prompts are deliberately extreme. Pure paraphrase and pure abstract negation are clean experimental probes, not a full map of enterprise agent behavior. Real workflows combine retrieval, tools, memory, user feedback, policy constraints, and sometimes the emotional residue of last quarter’s roadmap.

Fifth, the paper measures semantic movement, not correctness. A contractive trajectory can stabilize around a false claim. An exploratory trajectory can generate valuable alternatives. Geometry is a behavioral diagnostic. It is not epistemology, compliance, or product-market fit.

These boundaries do not weaken the paper’s central contribution. They define its proper use. The paper does not give businesses a turnkey agent safety system. It gives them a way to ask a better monitoring question: “What regime is this loop entering, and is that the regime we intended?”

The practical lesson: agents need motion control

The article-level comparison can now be reduced to one operational principle:

Do not evaluate recursive agents only by the final output. Evaluate their motion.

A loop that should refine should show bounded drift and eventual concentration. A loop that should explore should show controlled expansion without losing the task frame. A loop that should debate should not oscillate forever unless the product requirement is “simulate a committee,” in which case congratulations, you have achieved realism.

The paper’s deeper contribution is not that paraphrasing converges and negation drifts. Most attentive readers could guess that direction. The contribution is making the guess measurable: trajectories, local similarity, global drift, dispersion, cluster persistence, attractors, and regimes.

Once those concepts are measurable, they can become engineering objects. They can be logged. Compared. Thresholded. Used for early stopping. Used for prompt selection. Used for escalation. Used to test whether a supposedly improved agent workflow is actually more stable, or merely more verbose.

The next generation of agent operations will need this kind of layer. Tool-call logs tell us what the agent did. Cost dashboards tell us what it spent. Evaluation suites tell us whether selected outputs passed selected tests. Semantic trajectory monitoring tells us how the agent moved while producing them.

That movement is where many failures begin.

A recursive agent does not suddenly become unreliable at the final step. It usually drifts there, one fluent transformation at a time.

Cognaptus: Automate the Present, Incubate the Future.

Nicolas Tacheny, “Geometric Dynamics of Agentic Loops in Large Language Models: Trajectories, Attractors and Dynamical Regimes in Semantic Space,” arXiv:2512.10350, https://arxiv.org/abs/2512.10350. ↩︎

The paper turns agent loops into trajectories, not anecdotes#

Same model, same sentence, different loop physics#

The paraphrase loop settles into an attractor#

The negation loop keeps moving because the prompt keeps kicking it#

Stability is not always good; drift is not always bad#

What businesses can actually use from this#

The hidden design variable is the operator, not the prompt text alone#

Boundaries: what the paper shows, and what it does not#

The practical lesson: agents need motion control#