A workflow chart is comforting. It gives everyone boxes, arrows, and the illusion that power follows geometry.
In a multi-agent AI system, that illusion fails rather quickly. The agent in the middle of the diagram may not be the one shaping the final answer. The orchestrator may look important because everything passes through it, but another specialist agent may quietly determine the substance. A router may touch only one decision and still decide the entire path. A late-stage formatter may appear humble and yet rewrite the output enough to matter. The org chart lied. Naturally, the workflow diagram learned from management.
The paper CAIR: Counterfactual-based Agent Influence Ranker for Agentic AI Workflows asks a practical question that many agent builders are already pretending they have answered: which agent actually influences the final output?1 Not which agent is central in the graph. Not which agent has the grandest system prompt. Not which agent is called “Supervisor”, “Planner”, or, heaven help us, “Chief Strategy Agent”. Which agent changes the result when its behaviour changes?
That distinction is the paper’s useful contribution. CAIR does not treat a multi-agent workflow as a static graph. It treats it as a running system whose internal outputs can be perturbed, observed, and ranked. The method is not magic, and it does not prove philosophical causality. It gives a workmanlike influence ranking under specific assumptions: representative queries, visibility into intermediate agent outputs, and a workflow that can be replayed with injected counterfactual outputs.
That may sound narrow. In business terms, it is exactly the narrowness we need. Most firms do not need another slogan about “trustworthy agents”. They need to know where to put monitoring, guardrails, testing budgets, and debugging attention without tripling latency every time an agent says hello.
CAIR begins by breaking the workflow on purpose
CAIR’s core move is simple: run the workflow, change one agent’s output, rerun the downstream part, and measure what happens.
The method borrows the intuition of feature importance from classical machine learning. In a conventional model, we may perturb a feature and observe whether the prediction changes. In CAIR, the “features” are agents. More precisely, they are agent activations: an agent receives an input, produces an output, and hands something to the rest of the workflow. CAIR asks: if this output had been different, how much would the final workflow output have changed?
That is the right starting point because agent influence is behavioural, not decorative. A graph centrality metric can tell us who sits between other nodes. It cannot tell us whether the content produced by that node matters. In an agentic AI workflow, the text, JSON, tool call, branch selection, or intermediate reasoning artefact passed between agents may reshape everything that follows. The influence is in the execution trace, not the architecture slide.
CAIR has two phases.
First, the offline phase. The user supplies representative queries, ideally one for each workflow functionality. CAIR runs the workflow normally and records the activation flow: which agents were called, in what order, with what inputs and outputs. It then perturbs each activated agent’s output using an LLM prompt designed to create a plausible but meaningfully different output. The workflow is resumed from that perturbation point, and CAIR records how the rest of the system changes.
Second, the online phase. When a new query arrives, CAIR does not rerun the expensive perturbation process. It embeds the query, retrieves the most similar representative query from the offline set, and reuses the influence ranking already computed for that representative case. The claim is not that online CAIR discovers influence in real time. It predicts the likely ranking from a precomputed behavioural map.
This offline-online split matters. A full counterfactual analysis at inference time would be absurdly expensive for many workflows. CAIR accepts that cost offline and then spends almost nothing online beyond embedding and similarity lookup. It is the difference between inspecting the bridge before rush hour and stopping every vehicle to ask whether gravity still works.
The score measures two kinds of damage
The important design choice is that CAIR does not measure only final-output difference.
It looks at two effects.
The first is final output change. CAIR embeds the original final output and the perturbed final output, then uses cosine distance to estimate how different they are. This captures the obvious kind of influence: if changing an agent’s output substantially changes the final answer, that agent is probably important.
But this alone is not enough. Suppose a perturbation is extremely large. A huge change to an agent’s output may naturally produce a huge final difference, even if the agent is not intrinsically influential. CAIR therefore also measures the size of the original perturbation to the agent’s own output and adjusts for it using an amplification factor. That factor reflects where the agent appears in the activation sequence: a perturbation earlier in the workflow has more downstream room to propagate.
The second effect is workflow change. In flexible systems, an agent’s output may alter not only the final text but the path of execution itself. A router may select a different branch. An orchestrator may call a different worker. A workflow may add, remove, or reorder agents. CAIR captures this by comparing the original and perturbed activation flows using edit distance.
So the score is not merely “did the answer text move?” It is closer to: did changing this agent alter the output, alter the route, or both?
That distinction is particularly relevant for business workflows. A legal intake agent that changes which specialist is called may be more consequential than a drafting agent that changes a sentence. A claims-processing router that sends a case to fraud review instead of standard payout may matter more than a summariser downstream. Influence can be semantic, procedural, or both. CAIR at least makes room for that reality.
| CAIR component | What it measures | Operational interpretation |
|---|---|---|
| Agent output perturbation | A plausible alternative output from one activated agent | “What if this agent had said something different?” |
| Final output change | Semantic distance between original and perturbed final answers | “Did the customer-facing or decision-facing result change?” |
| Workflow change | Edit distance between original and perturbed activation flows | “Did the system call different agents or follow a different route?” |
| Offline ranking | Influence scores for representative queries | “Where should we expect risk or leverage for this workflow function?” |
| Online lookup | Nearest representative-query ranking reused at inference | “Which agents should receive attention for this new query?” |
This is why a mechanism-first reading is more useful than a leaderboard reading. CAIR is not merely another evaluation score. It is an operational recipe: perturb, observe, rank, reuse.
Static centrality is the tempting wrong answer
The likely misconception is obvious: if a workflow is a graph, then graph importance should identify agent importance.
The paper tests that temptation directly. It compares CAIR with graph-based baselines: betweenness centrality and eigenvector centrality. These are sensible baselines if one believes structure determines influence. Betweenness asks which nodes sit on many paths. Eigenvector centrality asks which nodes are connected to other important nodes.
Both are useful for networks. Multi-agent workflows, unfortunately, are not just networks. They are executable conversations with branching, tool use, agent roles, intermediate outputs, and sometimes enough autonomy to make the diagram mostly aspirational.
The authors evaluate CAIR using AAW-Zoo, a generated dataset of 30 agentic AI workflows across three architecture types: sequential, orchestrator, and router. The dataset includes 230 functionalities. Sequential workflows contain chain-like agent calls. Orchestrator workflows use an orchestrator to decide which agent acts next. Router workflows select one of several predefined branches.
Since there is no established ground truth for agent influence, the paper constructs a proxy reference called classical feature importance, or CFI. CFI trains a support vector regression model using 150 examples per functionality, represents agent contributions numerically, embeds final outputs, and applies SHAP-style feature importance. This is not a divine oracle. It is an expensive, behaviour-aware proxy. The paper is careful enough to admit the task has no pre-existing ground truth; the business reader should be equally careful not to treat CFI as gospel wearing a lab coat.
Against that proxy, CAIR performs best overall. In the aggregate results, CAIR reports 29.27% total ranking success, 62.6% precision at three, 80.95% precision at two, 76.1% precision at one, and 62.1% on one-minus-normalised Spearman’s footrule distance. The graph baselines lag substantially overall: betweenness reaches 6.0% total ranking success, 33.6% precision at three, 14.0% precision at two, 43.0% precision at one, and 50.2% on the footrule similarity measure; eigenvector centrality reaches 8.7%, 36.3%, 19.0%, 49.3%, and 41.9%, respectively.
The interesting part is not simply that CAIR wins. It is where the graph baselines fail.
In orchestrator and router workflows, CAIR outperforms the graph baselines across the reported metrics. That makes sense. In these systems, the activated path may depend heavily on the input query and intermediate outputs. Static structure cannot see which branch was actually used or how content changed downstream. It knows the plumbing. It does not know what flowed through the pipes.
Sequential workflows are more nuanced. There, eigenvector centrality beats CAIR on some metrics, including total ranking success, precision at one, and footrule similarity. The paper’s explanation is sensible: in a rigid sequential architecture, position in the chain can dominate influence. Later agents often have more direct control over the final answer. When the workflow is less autonomous, structure becomes a better proxy.
That nuance is valuable. CAIR is not saying structure never matters. It is saying structure is insufficient once workflows become flexible, routed, or behaviour-dependent. Which, inconveniently, is exactly where modern agentic systems are heading.
The main evidence is comparative, the appendices are diagnostic
The paper’s evidence has several layers, and they should not be read as equal.
The main evidence is the comparison against baselines on AAW-Zoo. This supports the central claim: a counterfactual, behaviour-aware method aligns better with the CFI proxy than static graph measures, especially in orchestrator and router architectures.
The online-setting test is a feasibility check. CAIR maps runtime queries to representative queries using embeddings and shows, in a gift-suggester example, that queries cluster with the appropriate functionality. This supports the offline-online design, but it is not the same as proving robust online ranking across all messy enterprise traffic.
The guardrail experiment is the business-relevance demonstration. The authors apply toxicity guardrails only to the top-ranked half of agents rather than to every LLM call. Across nine use cases, CAIR-based selective guardrailing achieves an average latency improvement of 27.72% with a 4.76% drop in effectiveness compared with applying guardrails everywhere. CFI gives a similar latency improvement, 26.49%, but with a larger effectiveness drop of 11.12%.
The ablation and sensitivity studies are robustness checks. They test whether CAIR’s components matter and whether the method collapses when parameter values change. The reported finding is that using all components together performs best and that rankings remain relatively stable across a broad parameter range, though extreme parameter choices reduce quality.
The production-ready LangGraph example is an exploratory extension. It uses a publicly available hierarchical multi-agent tutorial with supervisors and worker agents. CAIR’s rankings align with expected rankings across three representative queries, while CFI is not applied because the system’s higher agency makes its activation patterns too unpredictable for that proxy method. This is useful, but it is not a full production validation. It is a bridge example: more realistic than the generated zoo, less convincing than a deployed enterprise audit.
| Evidence item | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| AAW-Zoo comparison | Main evidence | CAIR aligns better than graph baselines with a behaviour-aware proxy across 30 generated workflows | That CAIR identifies true causal responsibility in arbitrary real systems |
| Online query matching | Feasibility check | Representative-query lookup can support low-latency inference-time ranking | That all enterprise queries will map cleanly to known functionalities |
| Toxicity guardrails | Downstream business demonstration | Influence rankings can reduce guardrail latency with limited measured effectiveness loss | That selective guardrails are safe for all harms, policies, or risk thresholds |
| Ablation and sensitivity tests | Robustness/sensitivity analysis | CAIR’s components and parameter settings are not arbitrary decoration | That parameter tuning is irrelevant in production |
| LangGraph hierarchical example | Exploratory extension | CAIR can produce plausible rankings in a more complex public workflow | That the method has been validated on confidential production systems |
This hierarchy matters because otherwise the paper is easy to overclaim. CAIR is promising because it gives a practical ranking mechanism and shows it works better than obvious alternatives in the authors’ testbed. It is not a universal accountability engine. No, the audit committee may not relax now.
Guardrails reveal the economic point
The guardrail experiment is where the paper stops being merely interpretability research and becomes operationally interesting.
The usual approach to workflow safety is brute force: apply guardrails to every LLM call. This is simple, defensible, and expensive. The paper notes that applying guardrails to every call can add substantial latency; their guardrail suite itself uses LLM calls to detect 11 categories of toxicity and then applies up to three rounds of correction when toxicity is detected.
That design creates a familiar enterprise trade-off. More checks improve coverage but slow the system. Fewer checks preserve responsiveness but may miss harmful outputs. The lazy answer is to say “balance safety and efficiency”, which is what people say when they have no measurement framework.
CAIR gives a more concrete option: apply expensive downstream controls to the agents most likely to influence the final output.
In the experiment, guardrails are applied to the top-ranked half of agents rather than all agents. The result is not perfect preservation of effectiveness. CAIR’s selective deployment shows an average 4.76% effectiveness drop. But it also gives a 27.72% average latency reduction. That is the business pathway: not replacing safety with speed, but using influence ranking to spend safety overhead where it buys the most output control.
This is especially relevant for agentic workflows because the unit of governance changes. In a single LLM app, the obvious control point is the model call. In a multi-agent workflow, the control points multiply. The question becomes: which calls deserve policy checks, logging, human review, red-teaming, or fallback logic?
CAIR suggests that governance should follow influence, not call count.
A practical organisation could use this idea in several ways:
| Business function | How CAIR-style ranking could help | Boundary |
|---|---|---|
| Safety guardrails | Prioritise expensive checks on agents that most affect final outputs | Selective checking still accepts measured residual risk |
| Observability | Focus tracing dashboards on agents that drive downstream changes | Requires access to intermediate outputs |
| Debugging | Investigate high-influence agents first when outputs degrade | Ranking depends on representative workflow functions |
| Compliance review | Identify which agents need stricter documentation and approval | Influence ranking is not legal accountability by itself |
| Cost optimisation | Avoid applying heavyweight controls uniformly across low-impact calls | Savings depend on latency and guardrail architecture |
The business implication is not “install CAIR and your agents are safe”. The implication is better triage. In enterprise AI, triage is underrated because it sounds less heroic than autonomy. It is also where budgets survive contact with reality.
Influence is not the same as responsibility
A central strength of the paper is also its main interpretive trap. CAIR uses counterfactual perturbations, so it is tempting to call the result causal. That needs care.
CAIR ranks influence under a constructed intervention: replace an agent’s observed output with a plausible alternative, then observe changes downstream. This is a useful operational notion of influence. It does not establish moral, legal, or organisational responsibility. Nor does it prove that the same agent caused every bad outcome under every possible query.
The distinction matters. A high CAIR score says: when this agent’s output changes in representative executions, the final output or workflow path changes substantially. That makes the agent a good candidate for monitoring, testing, or guardrailing. It does not say the agent is “to blame” in any broader sense.
This is particularly important in regulated settings. Suppose a credit workflow has an intake agent, a document extraction agent, a policy interpretation agent, and a final recommendation agent. If CAIR ranks the policy agent highest, that may justify deeper audit logs and tighter review for that agent. It does not remove responsibility from the team that designed the workflow, selected the data, approved the policy prompt, or deployed the final decision process. Machines may distribute influence. Organisations still own the system. A disappointing but legally convenient arrangement.
The dataset is useful, but synthetic simplicity matters
AAW-Zoo is a meaningful contribution because the field lacks shared datasets for agentic workflow evaluation. The authors generate 30 workflows across sequential, orchestrator, and router architectures, with 230 functionalities. Each use case includes representative queries, additional generated queries, toxic queries, and metadata.
That is enough to test the mechanism at scale. It is not enough to conclude that CAIR will behave identically inside a bank, insurer, hospital, logistics platform, or procurement workflow.
The paper itself notes that these are simple systems designed for research rather than stand-alone applications. Many enterprise workflows have messier characteristics: persistent memory, tool failures, user profiles, retrieval layers, external APIs, permission boundaries, human approvals, and domain-specific policy constraints. Some agents may call tools with side effects. Some may produce structured outputs whose small changes have large procedural consequences. Some may be hidden behind vendor abstractions where intermediate outputs are inaccessible.
CAIR’s assumptions are therefore not minor implementation details. They define where the method can be used.
First, CAIR needs representative queries. If a functionality is missing from the representative set, the online lookup may reuse a ranking that does not fit the actual query. The authors suggest generating representative queries from a system overview or clustering historical queries. That is reasonable, but it moves part of the method’s quality into workflow taxonomy design.
Second, CAIR needs access to agent outputs. It is not a black-box third-party audit method if the only observable artefact is the final response. For internal builders, that assumption is often acceptable. For buyers evaluating a vendor’s closed agentic system, it may not be.
Third, CAIR has parameters controlling the weighting of output change and workflow change. The sensitivity tests suggest stability across broad ranges, but production use would still need calibration. A router-heavy workflow and a fixed sequential workflow should not necessarily value path change in the same way.
Fourth, the offline cost is real. The paper’s complexity analysis shows that counterfactual replay can be expensive; in an illustrative larger scenario, the estimated offline analysis can stretch to roughly a week. That may be perfectly acceptable for high-value workflows reviewed periodically. It is less attractive for workflows that change daily because someone keeps “improving” the prompt in production, which is to say vandalising it with confidence.
The best use case is governance by influence map
The cleanest business interpretation is to think of CAIR as producing an influence map.
Not an explanation of every token. Not a proof of safety. Not a full audit trail. An influence map tells the operator where changes propagate and which agents deserve disproportionate attention.
That map has obvious uses before deployment. During testing, teams can identify high-influence agents and run deeper evaluations on them. During architecture review, they can ask whether too much influence has concentrated in one brittle component. During policy design, they can decide whether specific agents require stricter prompts, narrower tool permissions, or mandatory output validation.
It also has uses after deployment. If a workflow begins producing poor outputs, influence rankings can guide debugging. If latency becomes unacceptable, the rankings can inform which guardrails are most worth keeping. If compliance asks which parts of the system materially shape final responses, the team has something better than a diagram and a hopeful shrug.
The most mature version of this would combine CAIR-style rankings with other operational signals:
| Signal | What it adds |
|---|---|
| CAIR influence score | Which agents change outputs or workflow paths when perturbed |
| Error frequency | Which agents often produce invalid, unsafe, or low-quality outputs |
| Tool criticality | Which agents can trigger irreversible or high-risk actions |
| User exposure | Which outputs are directly seen by customers or decision-makers |
| Regulatory sensitivity | Which agents touch protected, financial, medical, or legal content |
The resulting governance model would not treat every agent equally. Nor should it. Equal treatment sounds fair until one agent is choosing a compliance path and another is choosing an emoji. Influence-aware governance is simply resource allocation with fewer blindfolds.
What leaders should not take away
The wrong takeaway is that CAIR lets companies apply guardrails only to a few agents and call the workflow safe.
The better takeaway is that selective controls need an evidence basis. CAIR provides one possible basis by ranking agents according to observed counterfactual impact. If the business risk tolerance is low, a 4.76% effectiveness drop may still be unacceptable. If latency is commercially critical and the workflow is low-risk, that trade-off may be attractive. The paper gives a way to measure the trade-off, not a universal answer.
Another wrong takeaway is that graph centrality is useless. In rigid sequential systems, structural position can still be informative. The paper’s own results show that eigenvector centrality performs competitively, and sometimes better, in sequential workflows. The correction is subtler: graph centrality becomes increasingly inadequate as workflows become more autonomous, routed, and behaviour-dependent.
A third wrong takeaway is that CAIR solves explainability for multi-agent systems. It does not explain why an agent produced a particular output. It does not decompose reasoning. It does not verify factuality. It ranks influence on the final workflow result. That is narrower than explainability, but more actionable than much of what currently passes for it.
The workflow boss may not have the biggest title
The paper’s quiet insight is that agentic AI needs operational introspection at the workflow level.
Single-model governance asks whether the model output is safe, accurate, or compliant. Multi-agent governance must ask a harder question: which internal component shaped the result enough to deserve control? Once systems contain routers, planners, specialist workers, critics, retrievers, tool callers, and final synthesis agents, blanket oversight becomes expensive and naive. Static diagrams become polite fiction.
CAIR offers a counterfactual way to rank agent influence. Its evidence is strongest as a research benchmark and an operational prototype: 30 generated workflows, 230 functionalities, clear gains over graph baselines in flexible architectures, and a guardrail demonstration showing meaningful latency reduction with limited effectiveness loss. Its limits are equally clear: representative queries, intermediate-output access, synthetic workflow simplicity, parameter choices, and incomplete production validation.
That balance is exactly why the paper is useful. It does not declare that multi-agent systems are now interpretable. It gives builders a way to ask a sharper question before deploying another beautifully named swarm of semi-supervised interns.
Who really runs the workflow?
Not always the orchestrator. Not always the central node. Not always the agent with the executive title.
The one whose changed output changes everything downstream.
Cognaptus: Automate the Present, Incubate the Future.
-
Amit Giloni, Chiara Picardi, Roy Betser, Shamik Bose, Aishvariya Priya Rathina Sabapathy, and Roman Vainshtein, “CAIR: Counterfactual-based Agent Influence Ranker for Agentic AI Workflows,” arXiv:2510.25612, 2025, https://arxiv.org/abs/2510.25612. ↩︎