Shift Happens: Detecting Behavioral Drift in Multi‑Agent Systems

Updates are boring until they are not.

A retrieval index changes. A tool permission is adjusted. A base model is silently upgraded. A memory module starts carrying yesterday’s weird interaction into today’s customer support workflow. Nobody sees smoke. The dashboard still says “healthy.” The agent still answers. Then, three weeks later, someone notices that one group of agents has become strangely aggressive, risk-averse, evasive, or just less aligned with the behavior the product team thought it had shipped.

This is the unpleasant part of agent operations: behavior can move without leaving a clean internal audit trail. In a normal software system, you can often inspect the diff. In an agentic system, the “diff” may be distributed across model weights you do not own, tools you only partially control, retrieved context that changes daily, and other agents influencing the environment. Very elegant. Also a governance migraine with nicer branding.

The paper Detecting Perspective Shifts in Multi-Agent Systems introduces the Temporal Data Kernel Perspective Space, or TDKPS, as a way to monitor this problem from the outside.¹ Its core move is simple enough to explain but subtle enough to matter: treat agents as black boxes, repeatedly ask them a stable set of questions over time, embed their responses, place every agent-at-timepoint into one shared geometric space, and then statistically test whether the resulting positions have shifted.

The important word is detect. TDKPS does not magically open the agent and tell you whether the culprit was the base model, the retrieval database, the environment, a tool, or the awkward little society of other agents whispering around it. It detects a change in observable input-output behavior. That boundary is not a weakness to hide in the final paragraph. It is the whole design contract.

The monitoring problem is not “what changed inside?” but “did behavior move?”

The paper defines an agent broadly: a generative system whose behavior can be affected by its base model, tools, storage, external environment, or update mechanism. That definition is useful because modern “agents” are rarely just models wearing a little hat. They are composite systems. A web crawler, a database, an LLM, and a prompt wrapper can together produce behavior that no single component fully explains.

That is also why the black-box setting matters. In many real deployments, the operator may not have access to weights, private tool internals, privileged retrieval data, or the hidden interaction structure among agents. The practical question becomes:

Given only inputs and outputs, can we tell whether an agent, or a group of agents, has changed behavior over time?

The paper’s answer is yes, under a controlled observation design. The operator needs a query set, repeated responses, an embedding function, and a statistical test. Not glamorous. Quite useful.

The current generation of agent monitoring often focuses on local incidents: one bad answer, one failed tool call, one hallucinated policy. TDKPS shifts the unit of analysis. It asks whether the behavioral distribution of an agent has moved relative to itself, to other agents, and to time. That is closer to “fleet monitoring” than ordinary prompt evaluation.

TDKPS makes black-box behavior visible by building a temporal geometry

The mechanism begins with a stable measurement ritual.

For each agent, at each timepoint, the system asks the same query set multiple times. Each answer is embedded into a vector. The paper’s real-data experiment uses 99 digital congressperson agents, 14 timepoints from 2018 to 2024, 100 queries per topic, 25 response replicates per query, and 768-dimensional response embeddings. In tensor form, each topic produces data shaped like:

$$ X \in \mathbb{R}^{14 \times 99 \times 100 \times 25 \times 768} $$

That tensor is not the final object of interest. It is the raw behavioral measurement.

TDKPS then averages embedded responses across replicates for each query and compares every agent-time combination with every other agent-time combination. The result is a large distance matrix: how far Agent A at time 1 appears from Agent B at time 2, and from itself at time 2, and from everyone else across the temporal system.

Classical multidimensional scaling then turns that distance matrix into a low-dimensional Euclidean representation. Each point in the space is not merely an agent. It is an agent at a timepoint.

A simple operational diagram looks like this:

Stable query suite
        ↓
Repeated agent responses across time
        ↓
Response embeddings
        ↓
Pairwise distances among agent-time pairs
        ↓
TDKPS shared temporal space
        ↓
Agent-level and group-level drift tests

The elegance is that temporal behavior becomes spatial movement. A drifting agent is no longer an abstract concern; it becomes a point moving in a shared geometry. The geometry is not automatically semantic. Axis 1 does not necessarily mean “vaccination skepticism” or “customer empathy” or “legal caution.” It is relational first. Meaning comes later, if you add domain labels, supervised interpretation, or human judgment.

That distinction matters. TDKPS is not a dashboard of human-readable traits. It is a statistical geometry for detecting movement.

The agent-level test asks whether one point moved farther than chance

Once each agent-time pair has a coordinate, the agent-level question becomes straightforward:

Is this agent’s position at time (t) meaningfully different from its position at time (t+1)?

The test statistic is the Euclidean distance between the two TDKPS representations of the same agent at two timepoints. A large distance suggests movement. But “large” only matters relative to a null distribution, so the paper uses a permutation strategy.

For a specific agent, the method pools that agent’s response replicates from the two timepoints and randomly redistributes them across time. This simulates what the observed distance would look like if there were no real temporal change. The method then compares the observed distance against the permutation-generated null distribution.

The technical wrinkle is important: because all agents are embedded in a shared space, perturbing one agent’s data can affect the geometry. The paper handles this by using a fixed basis when re-embedding permuted data, reducing coordinate instability. This is one of those details that sounds minor until it ruins your inference. Geometry is helpful; geometry is also needy.

The simulations show the agent-level TDKPS test achieving near-oracle power and outperforming a distance-correlation baseline across the tested settings. The “oracle” here is not a real competitor; it has privileged knowledge of the simulation’s hidden structure. Its purpose is to show what good performance would look like if the world kindly handed over the secrets it normally hides. TDKPS comes close without that access.

There is a cost. The agent-level test shows mild Type I error inflation, roughly 5–10% for the null class in the relevant simulations. The authors include a sample-splitting variant, TDKPS-C, that improves Type I error control by estimating the embedding basis and test statistic on separate subsets of replicates. That is a calibration fix, not a second thesis. It buys cleaner false-positive behavior at the cost of some power, especially when response replicates are limited.

For business monitoring, this trade-off is familiar. More sensitivity catches problems earlier but cries wolf more often. Better calibration reduces false alarms but may miss weaker shifts. Congratulations: statistical governance has discovered operations.

The group-level test is where fleet monitoring becomes practical

Agent-level testing is useful when you care about one agent: a sales bot in one region, a portfolio agent running one strategy, a compliance assistant assigned to one department. But most organizations will also care about groups.

Did all agents using a new retrieval pipeline shift? Did one party of agents in a simulation move differently from another? Did customer-facing agents change after a model upgrade while internal summarization agents stayed stable?

The paper’s group-level method, called PE TDKPS, uses an energy-distance statistic between the distribution of TDKPS embeddings at two timepoints. The permutation step preserves pairing: each agent’s two timepoint labels are randomly swapped, rather than treating the two time samples as unrelated piles of points.

That pairing is the quiet hero of the method. It respects the fact that Agent A at time 1 and Agent A at time 2 are linked. Ignoring that link throws away structure.

The group-level test is also computationally more attractive. Instead of repeatedly rebuilding embeddings for each individual agent permutation, it reuses a fixed distance matrix and relabels timepoints within agent pairs. In the paper’s terms, this provides major savings for aggregate inference. In business terms: this is the difference between a clever diagnostic and something you might actually schedule overnight without apologizing to the GPU budget.

The simulations test power, calibration, and where the method gets its advantage

The simulation design creates “temporal Gaussian blobs”: two classes of agents, one with a real temporal shift and one without. The latent signal is then obscured by query-specific random transformations and noise, mimicking the difficulty of observing agent behavior through many different prompts and response embeddings.

The key point is not that Gaussian blobs resemble every real agent. They do not. The point is that the simulation gives the authors controlled access to ground truth: which agents changed, how much, under what signal-to-noise conditions.

Paper component	Likely purpose	What it supports	What it does not prove
Agent-level simulations varying effect size	Main evidence	TDKPS detects larger behavioral shifts with higher power and approaches oracle performance	Real-world causal diagnosis
Agent-level simulations varying number of agents	Mechanism/sensitivity test	More agents improve embedding stability because TDKPS uses shared relational structure	That bigger systems are always safer or easier
Agent-level simulations varying queries and replicates	Sensitivity test	More diverse queries and more replicates improve detection; replicates also improve specificity	That any arbitrary query set is sufficient
Sample-splitting TDKPS-C	Calibration/robustness test	Type I error can be brought closer to nominal levels	Free accuracy; calibration costs power
Group-level PE TDKPS simulations	Main evidence for aggregate testing	Paired energy testing in TDKPS space maintains strong power and appropriate Type I behavior in most settings	Individual-level explanation
Embedding-model sensitivity appendix	Robustness test	The temporal pattern is highly stable across tested embedding models, with pairwise Spearman correlations above 0.99	Robustness to every embedding model or domain
Retrieval-depth sensitivity appendix	Robustness/sensitivity test	Broad temporal patterns persist across retrieving 1, 2, or 3 tweets, with correlations around 0.81–0.87	That retrieval choices are irrelevant
Retrieval-quality appendix	Implementation validation	Retrieved tweets are more relevant than random tweets across topics	That retrieved context fully explains the detected behavior

Two findings are especially useful for practitioners.

First, the number of agents is not merely scale; it is information. TDKPS embeds agents relative to one another. A richer population can stabilize the geometry, making individual movement easier to detect. This is counterintuitive if you think of multi-agent systems only as complexity monsters. Here, the fleet helps define the measurement frame.

Second, query design is not decoration. Increasing the number of queries improves power, but the appendix also shows that sensitivity depends on total signal, not simply the number of affected queries. A small, well-targeted query set can matter. A large, repetitive, weakly relevant query set can still waste everyone’s time with impressive confidence.

That last point deserves a little neon sign in any enterprise implementation: the monitoring suite is only as meaningful as the questions it asks.

The digital-congressperson experiment uses COVID as an external shock, not a party trick

The paper’s real-data experiment constructs “digital congresspersons.” Each agent uses Ministral-8B-Instruct-2410 with retrieval over a congressperson-specific database of that person’s tweets while in office. The authors include congresspersons continuously in office from 2016 to 2025 with more than 100 tweets by the end of Q1 2018, leaving 99 agents.

The agents are queried biannually from 2018 to 2024. For each question, the system retrieves the two most relevant tweets available up to that timepoint and uses them as context. The paper tests three query sets:

Query set	Role in the design	Expected interpretation
Public health	Main treatment-relevant topic	Should respond most strongly to COVID-era opinion shifts
General politics	Partially related control	Captures ordinary political evolution and strategic repositioning
Candy & chocolate	Orthogonal control	Captures generic temporal drift, pipeline artifacts, and unrelated stylistic movement

This is a natural experiment design. COVID-19 is treated as an exogenous event relative to congresspersons’ prior opinions. The method does not manipulate the pandemic, obviously. Even the most ambitious benchmark paper has limits. But the timing, topic specificity, and control query sets help separate public-health-specific movement from generic drift.

The agent-level results are the clearest evidence. Public health queries show the sharpest behavioral shifts during the first two years after COVID onset. The paper formalizes this by ranking average TDKPS embedding differences across consecutive timepoints and comparing those ranks with proximity to the midpoint of the two-year post-COVID window, around April 2021. For public health queries, the relationship is significant: Kendall’s (K = 0.51), (p = 0.014). For orthogonal candy queries, it is absent: (K = 0.01), (p = 0.956). For general political queries, it is weaker and not conventionally significant: (K = 0.34), (p = 0.11).

That pattern matters more than any single chart. If the method were merely detecting generic temporal drift, the candy queries should move too. If it were only detecting broad political churn, general political queries should look closer to public health. Instead, the strongest signal concentrates where the external event should matter.

The group-level results then act as a computationally efficient confirmation. The group-level test tracks the aggregate agent-level test statistics and identifies elevated public-health movement during the COVID period. Agreement between group-level p-values and combined agent-level p-values is strong: Republicans show Kendall’s (K = 0.69), (p = 0.001); Democrats show (K = 0.43), (p = 0.043); all agents combined show (K = 0.62), (p = 0.005).

This does not mean TDKPS has discovered the psychology of Congress. It means the black-box behavior of the digital agents changed in a way that is temporally aligned, topic-specific, and robust enough to be detected by both individual and group-level tests.

That is already a lot. No need to make it mystical.

What businesses can borrow is the monitoring architecture, not the political example

The digital-congressperson experiment is interesting, but the business relevance is not “simulate politicians with tweets.” Please do not build that unless you enjoy meetings with legal.

The reusable idea is a monitoring architecture:

Define a stable query suite that represents operationally important behavior.
Run agents against that suite at regular timepoints.
Collect multiple response replicates, especially for stochastic systems.
Embed responses with a fixed observation pipeline.
Build a shared temporal geometry across agents and time.
Test for individual and group-level movement.
Investigate flagged shifts with controlled follow-up experiments.

The last step is crucial. TDKPS is a smoke detector, not a forensic lab. When it flags a shift, the operator still needs to isolate causes: base model update, prompt change, retrieval corpus drift, tool behavior, environmental data, memory state, or interaction effects among agents.

A practical enterprise version might look like this:

Deployment context	What to monitor with TDKPS-like logic	What a detected shift should trigger
Customer support agents	Tone, refund-policy handling, escalation behavior, safety language	Review recent prompt/model/retrieval changes; sample live conversations
Internal copilots	Compliance wording, data-access caution, summarization style	Check policy updates, retrieval freshness, permission changes
Trading or research agents	Risk appetite, uncertainty language, source selection, scenario framing	Freeze strategy changes, compare against baseline query suite, audit data feed changes
Multi-agent workflow systems	Role consistency, handoff behavior, conflict resolution	Inspect coordination protocol, tool logs, memory propagation
Regulated advisory systems	Consistency across protected domains, refusal boundaries, disclosure style	Run compliance review before rollout or rollback

The return on investment is not that the method replaces evaluation. It is that it gives evaluation a statistical spine. Instead of waiting for anecdotal complaints, an operator can maintain a repeatable behavioral measurement layer and detect when a fleet has moved outside its expected regime.

That is especially valuable when agents are built from vendor models. If you do not control the internals, you need external behavioral tests. “The vendor said nothing changed” is not a monitoring strategy. It is a bedtime story for procurement.

The hard boundaries: query dependence, false positives, stateful agents, and causal localization

TDKPS is useful because it is disciplined. It is also limited for the same reason.

The first boundary is query dependence. The method detects change with respect to the query set used. If your query suite misses a risk domain, the test may not see movement there. If your prompts are too narrow, redundant, or semantically correlated, the geometry may look more stable than the system really is. The paper’s appendix shows that the temporal pattern is robust across embedding models and moderately robust across retrieval depths, but it also reinforces that pipeline choices matter.

The second boundary is false-positive control at the agent level. The uncalibrated agent-level test shows Type I error inflation around 5–10% in simulations. The sample-splitting variant improves calibration but reduces effective sample size. In a business system, this means the monitoring threshold should be chosen with operational cost in mind. A safety-critical workflow may tolerate more false alarms. A low-risk content-tagging pipeline may not.

The third boundary is sequential state. The framework treats query-response pairs as isolated samples. Many real agents maintain memory, conversation history, and prompt-order sensitivity. For those systems, a future version of the method would need to model sequential dependence. Otherwise, it may confuse normal state evolution with anomalous drift.

The fourth boundary is causal localization. The paper is explicit: TDKPS identifies observable behavioral shift, not the responsible component. In the congressperson setup, the base model is fixed and the retrieval context changes over time, so the source of change is narrowed by design. In a production agent stack where model, tools, retrieval, memory, and user environment all move at once, detection alone is not diagnosis.

This is where operational design becomes part of statistical validity. If a company wants TDKPS-like monitoring to support decisions, it should version components, preserve historical query suites, maintain baseline response archives, and run controlled comparisons after major changes. Otherwise it will have a beautiful drift alarm and a very ugly blame meeting.

The real contribution is turning agent behavior into something testable

The paper’s central contribution is not that agents drift. Everyone deploying agents already suspects that, usually while staring at a Slack thread titled “small issue.”

The contribution is the measurement mechanism: jointly embedding black-box agents across time, then testing movement at both the agent and group levels. That mechanism matters because multi-agent systems are not just collections of independent chatbots. They are relational systems. Agents define one another’s measurement context. TDKPS uses that fact instead of pretending it is noise.

The evidence is strongest where the paper stays close to what it can support. Simulations show power and calibration behavior under controlled conditions. The digital-congressperson experiment shows that the method can detect a topic-specific, temporally aligned shift around a real external event. The appendices show that some pipeline choices are robust, while others still matter. The limitations are not embarrassing; they are instructions for implementation.

For businesses, the lesson is straightforward: if agent fleets are going to act in changing environments, monitoring must move from anecdotal review to repeated statistical measurement. TDKPS offers one credible template. It will not explain every cause, replace audits, or remove the need for human interpretation. Good. Tools that promise to do all of that usually arrive carrying a demo and leave carrying your budget.

Behavioral drift is going to happen. The question is whether you notice it as a measured shift in a monitoring system or as a surprise in production.

One of those is called governance. The other is called Tuesday.

Cognaptus: Automate the Present, Incubate the Future.

Eric Bridgeford and Hayden Helm, “Detecting Perspective Shifts in Multi-Agent Systems,” arXiv:2512.05013. ↩︎

The monitoring problem is not “what changed inside?” but “did behavior move?”#

TDKPS makes black-box behavior visible by building a temporal geometry#

The agent-level test asks whether one point moved farther than chance#

The group-level test is where fleet monitoring becomes practical#

The simulations test power, calibration, and where the method gets its advantage#

The digital-congressperson experiment uses COVID as an external shock, not a party trick#

What businesses can borrow is the monitoring architecture, not the political example#

The hard boundaries: query dependence, false positives, stateful agents, and causal localization#

The real contribution is turning agent behavior into something testable#