Tools are not free.

That sentence sounds too obvious to deserve an article, which is usually a warning that the industry has built several architectures pretending it is false.

A tool-using AI agent can call a search API, query a database, inspect a document, ask another model, trigger a diagnostic pipeline, or run a workflow step. In a clean demo, each call feels like another harmless unit of intelligence. The agent thinks, acts, observes, thinks again, and the audience applauds because the trace looks busy. Busy is often mistaken for capable. Enterprise software has enjoyed this little confusion for decades.

The real world is less sentimental. Every extra query consumes latency. Some tools are expensive. Some tools are congested. Some evidence is ambiguous. Some decisions decay while the agent is still “reasoning.” The question is not whether the agent can collect more information. Of course it can. The question is whether the next unit of information is still worth the cost.

Davide Di Gioia’s paper, “Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents,” makes that question explicit.1 It proposes the Triadic Cognitive Architecture, or TCA, a controller for tool-using agents that prices three things together: the value of information, the latency of obtaining it, and the congestion cost of using shared tools. The paper’s useful contribution is not another slogan about safer agents. It is a mechanism for deciding when an agent should stop asking and start acting.

That matters because many agent systems still carry a hidden fantasy: more deliberation is better deliberation. TCA says no. More deliberation is only better when its marginal value exceeds its marginal friction. The rest is just expensive hesitation with a nicer trace log.

The mechanism: an agent should buy information, not inhale it

The paper starts from a simple operational problem. A tool-using agent must decide two things at inference time:

  1. Which tool should it query next?
  2. When should it stop querying and execute a decision?

Most agent frameworks answer these questions with heuristics: maximum steps, token budgets, confidence thresholds, or informal prompt rules. These controls are better than nothing, in the same way that a wall clock is better than no brakes. They can limit a runaway loop, but they do not tell the agent whether the next query is economically justified.

TCA replaces that heuristic frame with a state-dependent decision rule. The agent maintains a belief over possible hypotheses and a congestion state representing accumulated load. A tool query is useful if it is expected to reduce uncertainty. It is costly if it adds latency, consumes congested infrastructure, or delays a time-sensitive action.

In the paper’s discrete implementation, the controller estimates the expected entropy reduction from a candidate action, then subtracts a cost term for spatial and temporal friction:

$$ U(a; b_t, t, C_t) = \widehat{\Delta H}(a \mid b_t) - \alpha\left[\lambda_S(C_t + \Omega(a)) + \beta(t + \tau(a))\right]. $$

The notation is more useful than decorative. $\widehat{\Delta H}(a \mid b_t)$ is the estimated value of information: how much uncertainty the tool is expected to remove. $C_t$ is current congestion. $\Omega(a)$ is the load added by action $a$. $\tau(a)$ is the action’s latency. $\lambda_S$ prices spatial or network friction. $\beta$ prices delay. $\alpha$ scales costs into the same utility space as information gain.

The rule is blunt: choose the action with the highest net utility, and stop when even the best available action has non-positive value.

That is the paper’s central business translation. A well-designed agent should not ask, “Can I retrieve more?” It should ask, “Would the expected uncertainty reduction from retrieval justify the latency, congestion, and opportunity cost?” A surprising amount of current agent design still does the first one, but with better logging.

TCA component What it measures Business analogue Failure it targets
Belief state Distribution over possible answers or diagnoses Confidence over candidate decisions Confident hallucination under ambiguity
Value of information Expected entropy reduction from a tool call Expected decision improvement Querying tools that do not materially change the decision
Temporal friction Cost of elapsed time or latency SLA breach, delayed triage, lost opportunity Infinite or excessive deliberation
Spatial friction Cost of routing through congested tools API load, queueing, shared-system contention Tool spam and system congestion
Stopping boundary Point where continuation has no positive value Act, defer, escalate, or close the case Arbitrary max-step rules

Notice the structure. TCA is not merely a “stop early” policy. It is a joint controller for selection and stopping. That distinction becomes important in the experiments.

Why the HJB layer is more than mathematical perfume

The paper presents an idealized continuous-time formulation using nonlinear filtering and Hamilton–Jacobi–Bellman optimal stopping. This may look heavier than the problem deserves. After all, could we not just set a threshold on confidence and move on with our lives?

Not quite.

A confidence threshold sees only uncertainty. It does not know whether the next available evidence source takes five seconds or forty-five. It does not know whether that source is congested. It does not know whether delay itself is destroying the value of a correct decision. TCA’s HJB-inspired framing matters because it treats deliberation as a trajectory through state space, not as a static confidence score.

The paper’s continuous formulation defines a value function over belief, congestion, and time. The stopping region is where continuing has no value. Under the paper’s assumptions, the stopping boundary is monotone in congestion and time: once the agent reaches a state where it should stop, later time or higher congestion cannot magically make continued querying attractive again.

That “absorbing” property is more than a theorem-shaped ornament. Operationally, it prevents a familiar agent pathology: oscillating between “I need more information” and “I should answer now” because internal utility estimates are unstable. Once the cost side has crossed the boundary, the agent should not re-enter exploration mode unless the environment has genuinely changed in a modeled way.

The implemented controller is not solving the full HJB equation. The paper is careful here. In the simulation, TCA uses a rollout-based value-of-information approximation and a myopic stopping rule. The theory supplies the control envelope; the implementation uses a computable first-order approximation.

That is a sensible compromise. Solving a full HJB problem for a real enterprise agent with dozens of tools, streaming observations, hidden states, and nonstationary infrastructure would be an efficient way to convert a governance idea into a consulting invoice. The paper instead offers a rule that can actually be implemented: estimate marginal information value, subtract measured costs, stop when the best continuation is not worth it.

The medical simulation shows cheaper decisions, not smarter diagnoses

The main experiment is the Emergency Medical Diagnostic Grid, or EMDG. It is not a high-fidelity hospital benchmark. The paper explicitly designs it as a synthetic environment to isolate three pressures: ambiguity, tool latency, and congestion. The agent begins with a uniform belief over five critical pathologies. It can query hospital subsystems with different information profiles, loads, and latencies. Patient viability decays with elapsed time.

This setup matters because accuracy is intentionally saturated. Both the greedy ReAct-style baseline and TCA reach 100% accuracy. The question is not whether one agent can diagnose and the other cannot. The question is how much outcome value each burns while getting there.

The answer is not subtle.

Agent Time Patient viability Entropy Accuracy Interpretation
ReAct-style greedy 114.5 ± 3.1 56.76 ± 0.89 0.0386 ± 0.0038 1.00 More certainty, but purchased with damaging delay
Triadic Control 14.5 ± 0.4 93.03 ± 0.19 0.1714 ± 0.0119 1.00 Slightly less certainty, far better preserved outcome

The greedy agent maximizes expected entropy reduction without pricing latency or congestion. At step zero, it chooses the MRI network in 100% of seeds. MRI is informative, but slow: the paper uses a latency of 45 time steps for MRI versus 5 for the Hematology Lab. TCA chooses Hematology Lab in 100% of seeds because its lower latency and load make it better on net utility.

This is the central lesson of the experiment. The greedy agent obtains lower terminal entropy and higher total information gain. In a traditional benchmark table, that might look like better reasoning. In the simulated environment, it is worse decision-making because the patient’s viability decays while the agent is still improving an already adequate diagnosis.

The practical analogy is easy. A compliance agent that waits for a deep audit when a lightweight check would justify escalation may look thorough and still be operationally harmful. A cybersecurity triage agent that runs full forensics before containment may produce a beautiful report after the attacker has already enjoyed the scenery. A customer-support agent that calls every database before responding may be “grounded” in the same way a stone is grounded.

Useful reasoning is not maximum information. It is enough information before the decision expires.

Stopping alone does not solve the problem

One of the better parts of the paper is the purposive stopping comparison. This test addresses an obvious objection: maybe TCA wins simply because it stops earlier. Perhaps all we need is a confidence threshold or a fixed query budget.

The paper tests two baselines that keep the greedy MRI-first selection policy but add stopping rules calibrated to TCA’s behavior:

Policy Selection logic Stopping logic Time Viability Accuracy What the test means
ReAct-style greedy Highest expected information gain No friction-aware stop 114.5 ± 3.1 56.76 ± 0.89 1.00 Baseline over-querying
Entropy threshold Greedy MRI-first Stop at TCA-like entropy 102.6 ± 2.8 60.17 ± 0.80 1.00 Better, but still too slow
Fixed-K Greedy MRI-first Stop after 3 queries 135.0 ± 0.0 50.92 ± 0.00 1.00 Matching query count can make things worse
TCA Cost-aware selection Net-utility stopping 14.5 ± 0.4 93.03 ± 0.19 1.00 Selection and stopping work together

This is the paper’s most useful corrective to a shallow reading. The win does not come from telling the agent to stop. It comes from pricing the tool choice before the agent starts wasting time.

The entropy-threshold baseline improves viability only marginally, from 56.76 to 60.17. It still opens with MRI, so it still pays the large latency cost. Fixed-K performs worse than unconstrained ReAct because forcing exactly three MRI-style queries burns even more time. That result is almost comically useful: a “disciplined” query budget can be disciplined in precisely the wrong direction.

The difference is that TCA does not separate the questions “what should I query?” and “when should I stop?” It treats them as one control problem. In high-latency environments, that joint framing is not a theoretical luxury. It is the result.

The ablations and sensitivity tests are not a second thesis

The paper includes ablations, parameter sweeps, continuation-value tests, congestion-decay robustness, action-space scaling, and a black-box LLM appendix. These should not be read as independent claims of real-world deployment readiness. They have narrower purposes.

The ablation study removes pieces of the controller: no stop, no spatial term, no temporal term, and no congestion feedback. Its role is diagnostic. It asks whether the full triadic structure matters or whether one piece is doing all the work. The paper reports that the full controller achieves the highest viability while minimizing deliberation time, supporting the claim that space, time, and stopping should be priced jointly.

The parameter sensitivity section asks a different question: is the result brittle to particular hyperparameters? The answer is mostly reassuring within the tested range. Increasing the cost scale $\alpha$ makes the agent stop earlier, reducing time from 44.3 to 11.5 and increasing viability from 80.3 to 94.4. Changing temporal decay $\beta$ behaves non-monotonically because it affects both the environment’s physical decay and the temporal friction term. Increasing spatial weight $\lambda_S$ shortens deliberation from 22.2 to 13.3 and improves viability from 89.7 to 93.6, with diminishing returns once the controller reliably avoids costly routes.

These tests support the mechanism. They do not tell a product team what $\alpha$, $\beta$, or $\lambda_S$ should be in a real workflow. That calibration would depend on SLA penalties, API costs, user tolerance, risk class, and whether the decision is reversible. The paper gives the control grammar; the enterprise still has to fill in the exchange rate between seconds, dollars, uncertainty, and harm.

The security simulation shows transfer across domains, not universal proof

The second environment, the Network Security Triage Grid, shifts from medical diagnosis to security incident response. The agent identifies one of five threat categories and decides on containment. The tool profiles change: QuickScan is low-latency and lower-gain, while FullForensics is high-latency and higher-gain. The temporal urgency is lower than in the medical setting, and the environment includes a congestion shock at step two.

The pattern repeats.

Environment Agent Time Resource metric Entropy Accuracy
EMDG ReAct-style greedy 114.5 ± 3.1 56.76 ± 0.89 0.0386 ± 0.0038 1.00
EMDG Triadic Control 14.5 ± 0.4 93.03 ± 0.19 0.1714 ± 0.0119 1.00
NSTG ReAct-style greedy 149.7 ± 4.2 64.08 ± 0.80 0.0376 ± 0.0037 1.00
NSTG Triadic Control 9.5 ± 0.3 97.18 ± 0.08 0.3230 ± 0.0213 1.00

In NSTG, TCA selects QuickScan in 100% of seeds at step zero, while ReAct selects FullForensics in 100%. TCA preserves system integrity by acting roughly 16 times faster, with equal accuracy. The integrity gap is 33.1 points, slightly smaller than the 36.3-point viability gap in EMDG because the security environment is modeled with lower temporal urgency.

This is useful evidence for domain transfer at the level of mechanism. The same cost-aware logic works in two structurally distinct synthetic environments. It is not proof that TCA will automatically perform well in real SOC operations, where observations are messier, adversaries adapt, and “containment” may have its own false-positive cost. But it does support the paper’s main claim: when tool latency and resource decay matter, a greedy information-maximizer is the wrong baseline to worship.

The appendix quietly tells product teams where the engineering pain will live

Appendix B expands the action space to 5, 10, and 20 tools using randomized configurations. This matters because the main environments are small: two non-trivial tools plus Stop. In the scaling sweep, TCA still outperforms ReAct across all tested action-space sizes. The viability gaps are smaller than in the two-tool EMDG case, roughly 10 to 15 points, because random tool sets sometimes give the greedy baseline a less disastrous high-gain tool. Still, TCA remains ahead.

More interesting is the “non-trivial step-zero” statistic. TCA selects a non-minimum-latency tool at step zero in 76% of configurations for 5 tools, 85% for 10 tools, and 90% for 20 tools. That matters because it shows the controller is not simply choosing the cheapest tool. It chooses a more costly tool when information value justifies the extra latency. In other words, this is cost-aware optimization, not austerity cosplay.

Appendix C gives the discrete controller pseudocode. It is refreshingly practical: clone the belief state, sample observations, estimate entropy drops, compute net utility, select the best action, and stop if its value is non-positive. This is the part a system architect can actually translate into an inference-time middleware layer.

Appendix D then tries a black-box LLM instantiation on a fictional closed-world corpus. The paper uses 25 invented documents and 60 questions so GPT-4.1 cannot rely on memorized facts. Belief uncertainty is proxied by self-consistency entropy over five sampled answers; VOI is proxied by how much retrieval reduces that entropy. TCA retrieves selectively, using 63% of AlwaysRetrieve’s retrieval budget, acts on 72% of questions, and defers on 28%.

The results are deliberately modest. AlwaysRetrieve has higher exact match and higher expected utility in that setup: EM 0.817 and utility +5.33 versus TCA’s EM 0.500 and utility +2.20. The paper does not pretend otherwise. The purpose of the appendix is not to show that TCA beats retrieval. It is to show that the stopping principle can be implemented around a black-box LLM using observable proxies.

That distinction is important. The LLM appendix is an implementation demonstration, not a performance benchmark. It also exposes a real limitation: self-consistency can mistake confident ignorance for certainty. Five act-without-retrieve cases had zero proxy entropy because the model gave identical sampled answers, and all five were incorrect. The proxy failed; the abstract stopping rule did not magically save it. A controller is only as good as the belief signal it prices. Shocking, I know: the meter matters.

How this maps to business systems

The business relevance of TCA is strongest in workflows where information acquisition is costly, heterogeneous, and time-sensitive. That includes RAG pipelines, incident triage, medical-support workflows, fraud investigation, compliance review, procurement analysis, and multi-agent operations over shared tools.

The controller would sit between the agent planner and the tool layer. Instead of allowing the agent to call tools until a prompt rule says “enough,” the controller would score each candidate tool call before execution.

Deployment layer TCA-style question Example measurement
Belief estimation How uncertain is the current decision? Class probabilities, calibrated confidence, ensemble disagreement, retrieval answer entropy
VOI estimation How much would this tool likely reduce uncertainty? Rollout simulation, historical lift, retrieval uncertainty reduction, learned surrogate
Temporal pricing What does delay cost? SLA penalties, churn probability, incident spread rate, time-to-treatment loss
Spatial pricing What does tool access cost under current load? API latency, queue depth, token cost, rate-limit pressure, downstream service congestion
Stop/action rule Is the best next query still worth it? Act, retrieve, defer, escalate, or ask human

For business leaders, the key shift is from “agent intelligence” to “agent runtime governance.” The question becomes less mystical and more measurable: what is the marginal return on one more API call?

For product teams, this suggests a practical roadmap:

  1. Define decision states where uncertainty can be represented, even imperfectly.
  2. Instrument real latency, cost, and congestion for each tool.
  3. Estimate whether each tool historically changes decisions, not merely whether it adds context.
  4. Create stop, act, defer, and escalate actions as first-class outcomes.
  5. Calibrate cost weights by workflow risk, not by developer vibes.

This is not marketing copy for replacing humans. In many enterprise workflows, the correct terminal action will be “defer” or “escalate,” especially when uncertainty remains high and tool value is exhausted. TCA’s value is not that it forces automation. It gives automation a principled way to stop pretending that another query will fix everything.

Where the paper’s evidence stops

The paper is strongest as a framework and controlled demonstration. Its boundaries are clear.

First, the main environments are synthetic. That is not a flaw by itself; synthetic environments are useful when the purpose is to isolate mechanism. But it means the reported 30-plus-point resource gains should not be pasted into a sales deck as expected enterprise ROI. They are controlled evidence that the mechanism can matter, not a deployment forecast.

Second, the belief states are explicit and categorical in the main simulations. Real LLM agents often operate with implicit, poorly calibrated uncertainty. Turning model logits, sampled answers, retriever scores, or ensemble disagreement into a trustworthy belief state is hard. The black-box LLM appendix shows one bridge, but also shows why a weak uncertainty proxy can fail.

Third, the main action spaces are small, although the appendix partially addresses this with randomized action-space scaling. Real enterprises may have hundreds of tools, nested workflows, tool dependencies, permissions, and side effects. Rollout-based VOI estimation can become expensive. The paper suggests learned surrogates, cached rollouts, candidate screening, and analytic approximations. Those are reasonable directions, but they are engineering work, not solved theorem.

Fourth, the framework models a single agent interacting with shared resources. Congestion hints at multi-agent interference, but strategic interaction among multiple agents drawing from the same tool pool is not fully modeled. In actual organizations, one agent’s “cheap” query can become expensive when 500 sibling agents discover the same clever trick at 9:01 a.m.

These limitations do not weaken the article’s main point. They locate it. TCA is not a finished enterprise agent platform. It is a useful decision-theoretic skeleton for inference-time control.

The real lesson: friction is a design feature

The industry has spent years rewarding agents that produce longer traces, call more tools, and maintain the theatrical posture of deep thought. TCA pushes in the opposite direction. It says an agent should feel resistance. It should know that time passes, tools clog, uncertainty has structure, and a perfect answer delivered too late is not perfect.

That is a more mature view of autonomy. Not because it sounds philosophical, but because it changes what the system optimizes. The agent is no longer trying to maximize reasoning activity. It is trying to maximize decision value under constraint.

There is a small insult hidden in that idea. Many supposedly advanced agent systems are still managed with the equivalent of “try a few more steps and see what happens.” TCA offers a cleaner alternative: price the next step. If the price is not worth paying, stop.

Friction is not the enemy of intelligence. In deployed systems, friction is often what turns intelligence from performance art into judgment.

Cognaptus: Automate the Present, Incubate the Future.


  1. Davide Di Gioia, “Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents,” arXiv:2603.30031v3, 2026. https://arxiv.org/abs/2603.30031 ↩︎