Bits, Bets, and Budgets: When Agents Should Walk Away

Budget is not an afterthought

Budget is usually treated as the boring part of agent design. The exciting part is the agent: planning, calling tools, trying strategies, revising itself, and occasionally behaving like a junior analyst who has discovered both confidence and the corporate credit card.

But in real automation, budget is not boring. Budget is the boundary between useful autonomy and expensive wandering.

The practical question is not simply whether an agent might solve a task. Given enough retries, enough tools, and enough tolerance for chaos, many tasks remain theoretically approachable. The sharper question is whether the agent should attempt the task under a finite resource constraint. How much uncertainty must be removed? How much information does each action actually provide? How much does each action cost? And at what point should the agent stop pretending that persistence is strategy?

The paper The Agent Capability Problem: Predicting Solvability Through Information-Theoretic Bounds proposes a framework for answering that question before the agent has already burned the budget.¹ It frames problem-solving as information acquisition. A task requires some amount of information to identify an acceptable solution. Each action reveals some information. Each action has a cost. From those three pieces, the framework derives an effective cost estimate and uses it as a solvability test.

That sounds almost too neat. It is not a magic dashboard for certifying LLM agents. The paper’s contribution is more specific and more useful: it gives a language for agent triage. Instead of asking, “Can this agent try harder?”, ACP asks, “Is this task structured so that trying is likely to buy enough information per unit cost?”

That is a healthier question. Less heroic, admittedly. But heroes are expensive.

ACP turns agent feasibility into an information budget

The Agent Capability Problem starts with a simple setup. There is a hypothesis space $\Theta$, containing possible candidate solutions. A subset $\Theta_{\text{goal}} \subseteq \Theta$ contains acceptable solutions. An agent takes actions $a \in A$, observes outcomes, and pays a cost $C_s(a)$ for doing so.

The key move is to stop treating task success as a vague property of “intelligence” and instead describe it through uncertainty reduction.

The paper defines the total information requirement as the entropy of the goal indicator:

$$ I_{\text{total}} = H(1(\theta \in \Theta_{\text{goal}})). $$

In plain terms, this measures how much uncertainty the agent must remove to determine whether a candidate belongs to the acceptable goal set. If acceptable solutions are easy to identify or abundant, the information requirement is smaller. If acceptable solutions are rare, hidden, or hard to distinguish from bad candidates, the task becomes more expensive in informational terms.

Each action then has an information yield:

$$ I_s(a) = I(1(\theta \in \Theta_{\text{goal}}); y \mid a), $$

the mutual information between the action’s observed outcome and whether a candidate satisfies the goal. A good action narrows the search. A bad action produces noise with a receipt.

The agent should therefore choose actions by information efficiency:

$$ a^\ast = \arg\max_{a \in A} \frac{\mathbb{E}[I_s(a)]}{C_s(a)}. $$

This is the mechanism at the center of the paper. The agent is not merely searching. It is buying bits. Some actions are cheap and informative. Some are expensive and decorative. ACP says the agent should prefer the former, which is less glamorous than “emergent reasoning” but far more useful for operations.

The effective cost is then:

$$ C_{\text{effective}} = \frac{I_{\text{total}}}{\bar{I}_s} \times \bar{C}_s. $$

If $C_{\text{effective}} \leq B$, where $B$ is the available budget, the task is predicted to be feasible under the modeled policy. If not, the agent should reconsider. “Reconsider” is doing a lot of work here: it may mean abandon the task, relax the goal, choose better tools, gather cheaper evidence, or redesign the workflow.

This is the first important business translation. ACP is not mainly a solver. It is a pre-flight check.

The lower bound is the discipline, not the promise

The tempting misreading is that $C_{\text{effective}}$ tells you what the agent will spend. That is not quite right. The paper proves a more disciplined claim: under its stochastic model, $C_{\text{effective}}$ lower-bounds the expected search cost.

The proof treats information accumulation as a sequential stopping problem. Each action contributes an information increment $X_i$. The agent stops when accumulated information reaches $I_{\text{total}}$. The paper assumes conditional independence, bounded moments, and diminishing returns: early actions may remove large chunks of uncertainty, while later actions often refine what remains. The expected gain sequence satisfies:

$$ \mu_1 \geq \mu_2 \geq \cdots \geq \mu_{\inf} > 0. $$

The core two-sided result is:

$$ C_{\text{effective}} \leq \mathbb{E}[C] \leq C_s \left( \frac{I_{\text{total}}}{\mu_{\inf}} + \frac{M_2}{\mu_{\inf}^2} \right), $$

where $M_2$ bounds the second moment of information increments.

The lower bound matters because it kills a common management fantasy: “Maybe the agent will be clever and do it much cheaper.” Under the framework’s assumptions, the information requirement cannot be wished away. If the task requires a certain amount of uncertainty reduction, and the best available actions reveal information slowly, there is no free lunch hiding inside a larger prompt.

The upper bound matters for the opposite reason. Even if the initial estimate looks manageable, real search can overshoot. Information gains may deteriorate. Observations may be noisy. Later steps may provide less information than early ones. The paper captures this with the overshoot term, tied to $M_2/\mu_{\inf}^2$.

That term is where the framework becomes operationally interesting. It says the reserve budget should not be a flat percentage. It should grow when information gains are noisy, weak, or likely to decay.

A simple way to read the mechanism is this:

Quantity	What it means	Operational interpretation
$I_{\text{total}}$	Information required to identify an acceptable solution	How hard the task is to specify, distinguish, or verify
$I_s$	Information gained per action	How useful each tool call, query, test, or observation is
$C_s$	Cost per action	Token cost, API cost, latency, human review, compute, or risk exposure
$C_{\text{effective}}$	Optimistic effective cost	The minimum realistic spend before safety margin
Overshoot term	Cost inflation from noisy or decaying information gains	The reserve required when search gets harder over time

This is the paper’s most useful conceptual contribution. It separates task difficulty from agent effort. A workflow can fail because the agent is weak. But it can also fail because the task offers too little information per action. ACP gives us a way to tell the difference.

High-confidence budgets are about reserves, not bravado

The paper also derives a high-probability bound using Hoeffding’s inequality. When information increments are bounded, the framework can estimate a number of steps sufficient to finish with probability at least $1-\delta$. The resulting cost bound contains a reserve term that grows with $\log(1/\delta)$.

That is not just mathematical decoration. It changes how teams should think about agent reliability.

For low-stakes automation, an average-case estimate may be enough. A customer-support summarizer that occasionally asks for human review is not a mission-critical controller. But for workflows where budget overruns or failed attempts are costly—compliance review, fraud investigation, incident response, financial reconciliation—the relevant question is not “What is the expected cost?” It is “What reserve do we need so failure remains acceptably rare?”

ACP does not remove the need for operational judgment. It gives that judgment a variable to tune. Lower $\delta$ means higher confidence and a larger reserve. This is boring in the way seatbelts are boring.

The GP estimator is the bridge from theory to workflow design

A framework is only useful if its quantities can be estimated. The paper’s practical route is to use a Gaussian process surrogate. The unknown function over the hypothesis space is modeled with a GP prior. The system estimates the entropy of the goal distribution, simulates possible outcomes for candidate actions, calculates posterior entropy reductions, and then predicts effective cost.

The estimation procedure can be compressed into four steps:

Model uncertainty over the hypothesis space with a GP prior.
Estimate $I_{\text{total}}$ from the induced distribution over acceptable goals.
Estimate expected information gain by simulating action outcomes and posterior entropy.
Compute $\hat{C}_{\text{effective}}$ and compare it with budget.

The paper also discusses two estimation errors. One comes from finite sampling when averaging over candidate actions. The other comes from surrogate misspecification: the GP model may not represent the real function well. Under RKHS-style assumptions, the paper gives a surrogate error bound and propagates errors into the effective cost estimate.

The business interpretation is straightforward: never use a point estimate alone. If the estimated cost is close to the budget ceiling, the correct answer is not “green light.” It is “we need margin, a better estimator, or a cheaper action policy.”

This is where ACP becomes a design tool rather than a theorem. It asks teams to define the agent workflow in measurable pieces:

Design question	ACP version
What does success mean?	Define $\Theta_{\text{goal}}$ clearly enough to estimate $I_{\text{total}}$
What actions can the agent take?	Define the action set $A$ and cost $C_s(a)$
Which actions are informative?	Estimate $I_s(a)$, not just whether the action looks plausible
When should the agent stop?	Compare effective cost plus reserve against budget
When should the task be redesigned?	When information gain is too low or too expensive

The annoying part, naturally, is that this requires teams to specify what their agents are doing. That is also the benefit.

The experiments test calibration more than spectacle

The paper validates ACP in two experimental settings: noisy slope identification and random graph coloring. These are not enterprise workflows, and the paper does not pretend they are. Their purpose is narrower: to test whether the theoretical cost estimate behaves like a lower bound and whether it tracks increasing difficulty.

That distinction matters. The experiments are not a grand benchmark of agent intelligence. They are calibration tests for an information-cost model.

Test	Likely purpose	What it supports	What it does not prove
Noisy slope identification with an LLM agent	Main evidence for ACP as a predictor under noisy observations	ACP predictions lower-bound actual steps; the prediction-execution gap widens as noise increases	That arbitrary LLM tool workflows can be accurately priced with the same estimator
Random graph 3-coloring	Main evidence plus comparison against random and greedy search	ACP-guided ordering reduces node expansions versus random and modestly improves over greedy in harder settings	That ACP solves NP-hard search broadly or dominates specialized solvers
GP estimation and error propagation	Implementation detail with theoretical support	The framework can be approximated, but needs margin for sampling and surrogate error	That GP surrogates are always appropriate for business processes
Approximate-goal extension	Exploratory extension	Relaxing the goal can reduce information requirements	That approximation quality is easy to choose in real organizations

In the noisy slope task, the agent tries to identify the slope $a \in [-2,2]$ of a linear function $y(x)=ax+\epsilon$ by querying points. The paper reports that ACP’s predicted step count remains below the LLM agent’s actual average number of steps, while the gap grows as noise increases. That is exactly what the theory suggests: noisier observations reduce effective information gain and increase overshoot.

The graph-coloring experiment is more concrete. The task is 3-coloring random graphs $G(n,p)$, with non-colorable instances discarded after verification by an integer linear program. The paper compares random search, greedy search, and ACP-guided search using backtracking with forward checking. Cost is measured as node expansions.

The reported table is small but useful:

Instance $(n,p)$	Random	Greedy	ACP	ACP prediction
$(8,0.25)$	9.16	8.00	8.00	8.00
$(10,0.30)$	13.64	10.16	10.00	10.00
$(12,0.35)$	27.34	13.10	13.04	12.00
$(15,0.35)$	47.40	18.58	18.34	15.00
$(15,0.41)$	39.46	18.08	16.46	15.00

The pattern is more important than the absolute size. ACP is not dramatically better than greedy on the easiest cases. On harder cases, its advantage becomes more visible. Its predicted cost remains at or below observed cost, consistent with the lower-bound interpretation. The overshoot increases as instances become harder or denser, which matches the theory’s warning that deteriorating information gains inflate real cost.

This is good evidence for the mechanism. It is not yet evidence for a production-grade agent budgeting platform. Small random graph instances are not messy procurement workflows, legal reviews, or multi-system enterprise automations. The useful conclusion is narrower: when the task can be represented as structured search and information gain can be estimated, ACP’s cost logic behaves sensibly.

That is enough to be interesting.

The business value is task triage, not agent mysticism

The strongest business use of ACP is not “make agents smarter.” It is “decide which tasks deserve agent effort.”

That shift matters. Many organizations evaluate agents after deployment: run the workflow, observe failures, add guardrails, retry, complain about hallucinations, then add another model call. ACP suggests a pre-deployment discipline: estimate the information economics of the task before building the retry machine.

For business workflows, the practical pathway looks like this:

ACP signal	Operational decision	Business interpretation
Low $I_{\text{total}}$, high $I_s$, manageable $C_s$	Run the agent	The task is well-specified and actions are informative
High $I_{\text{total}}$, high $I_s$	Run with reserve	The task is hard but measurable; budget should reflect difficulty
High $I_{\text{total}}$, low $I_s$	Redesign the workflow	The agent lacks informative actions; add tools, data, tests, or human checkpoints
Low $I_s/C_s$ for key actions	Replace or reorder tools	Expensive actions are not buying enough information
Estimated cost near budget	Require margin or fallback	The workflow is fragile under uncertainty
Exact goal too costly, approximate goal feasible	Relax the success criterion	Approximation is not compromise; it is budget design
Cost remains above budget even after redesign	Walk away	The task is not automation-ready under current constraints

This framing is especially useful for multi-tool agents. A search query, database lookup, code execution, API call, OCR step, retrieval pass, or human review can all be treated as actions with costs and expected information gains. ACP encourages ranking actions not by habit, but by information yield per unit cost.

That has direct ROI relevance. Expensive workflows often fail because teams optimize the model while ignoring the action economy. They buy a stronger model, but the agent still asks low-value questions, retrieves irrelevant documents, runs unnecessary tools, or loops through ambiguous states. ACP points to a different diagnosis: the bottleneck may be information gain, not model capability.

In that sense, the paper offers a useful corrective to the current agent conversation. Not every failure is a reasoning failure. Some failures are budgeted ignorance.

Approximation is a first-class budget lever

The paper’s extension to approximation algorithms is brief but important. Many real problems do not need the exact optimum. They need an acceptable solution within deadline, risk tolerance, and cost.

ACP handles this by redefining the goal set:

$$ \Theta_{\text{goal}}(\epsilon) = {\theta \in \Theta : f(\theta) \leq (1+\epsilon)f(\theta^\ast)}. $$

As $\epsilon$ increases, the acceptable goal set grows. A larger goal set reduces the information requirement $I_{\text{total}}(\epsilon)$. In business language: relaxing precision can make the task cheaper because the agent no longer needs to distinguish between many nearly equivalent options.

This is more than mathematical neatness. It is how automation actually works.

A procurement agent may not need the globally optimal supplier; it may need a compliant supplier within 3% of the best expected cost. A scheduling agent may not need the perfect timetable; it may need a conflict-free timetable that satisfies priority constraints. A research assistant may not need every relevant document; it may need enough evidence to support a decision memo with known uncertainty.

ACP makes that trade-off explicit. Exactness is not free. Approximation is not laziness. It is a way to enlarge $\Theta_{\text{goal}}$ and reduce the information burden.

The boundary is equally important. If approximation below a certain ratio is impossible for a class of problems, ACP treats the information requirement as effectively infinite for that target. Translation: some goals are not merely expensive; they are structurally unrealistic under the chosen constraint. A mature agent should know the difference.

Where ACP applies cleanly, and where it gets slippery

ACP is strongest when four conditions hold.

First, the goal set must be definable. The framework needs some way to say what counts as an acceptable solution. This is natural for optimization, classification, diagnosis, search, scheduling, and structured decision tasks. It is much harder for open-ended strategy, creative work, negotiation, or vague executive requests such as “find insights.” A classic business phrase, very useful for meetings and almost useless for entropy.

Second, actions must have estimable information value. If a tool call, query, experiment, or observation reliably reduces uncertainty, ACP has something to measure. If actions produce ambiguous social signals, shifting preferences, or unstructured narratives, information gain becomes harder to estimate.

Third, costs must be meaningful and comparable. Token spend is easy. Latency is manageable. Human review time is measurable. Compliance exposure, reputational risk, and customer annoyance are measurable only after a meeting with people who use the word “holistic” too often. Still, the framework forces the conversation.

Fourth, the environment should not change too quickly. ACP assumes enough stability for prior uncertainty, action effects, and information gain estimates to remain useful. Dynamic environments—markets, adversarial systems, live cybersecurity incidents, fast-changing customer behavior—need adaptive versions of the framework, not a static pre-flight estimate.

The paper itself acknowledges future work around multi-agent coordination, dynamic environments, and tighter surrogate models. Those are not minor implementation details. They are exactly where enterprise agent systems become interesting and unpleasant.

A fair practical boundary is this: ACP is not a universal capability certificate. It is a structured feasibility model for bounded tasks with measurable uncertainty, action costs, and success criteria.

That is still valuable. Most enterprise automation should be more bounded than people admit.

What Cognaptus would infer for agent operations

The paper directly shows an information-theoretic formulation of agent solvability, proves lower and upper cost bounds under assumptions, gives a GP-based estimation procedure, and validates the prediction logic on noisy slope identification and graph coloring.

From that, Cognaptus would infer a practical operating model for agent deployment:

Before deployment, score workflows by information economics. Do not ask only whether the model can perform the task in a demo. Ask whether the available actions reveal enough information per unit cost to justify autonomous execution.
Separate capability failure from observability failure. If $I_s$ is low, the agent may not need a larger model. It may need better tools, clearer intermediate tests, richer data, or a redesigned workflow.
Use approximation deliberately. Define “good enough” as an explicit goal set. This can reduce information requirements and turn infeasible automations into feasible ones.
Treat budget margins as part of reliability engineering. A workflow whose effective cost barely fits within budget is not robust. The overshoot term is not a nuisance; it is the price of uncertainty.
Make refusal a designed behavior. An agent that walks away from low-information, high-cost tasks is not less autonomous. It is less reckless.

The uncertainty is also clear. The paper’s experiments are limited and synthetic. The GP surrogate is one estimation path, not a universal recipe. The theory depends on assumptions about information increments and search behavior. Real enterprise workflows may require empirical calibration before ACP-style estimates become dependable.

But the direction is right. The next generation of agents should not be judged only by how long they can continue. They should be judged by whether continuing is rational.

The better agent is the one that prices ignorance

The most useful idea in ACP is not the formula itself. It is the attitude behind the formula.

Agent systems should not treat uncertainty as a fog to be charged through. They should price it. They should ask how many bits remain, how expensive those bits are, whether the next action will actually reduce uncertainty, and whether the goal should be relaxed before the budget becomes a memorial.

This is a quieter vision of autonomy than the usual one. No dramatic self-improvement loop. No claim that agents will soon run the company while humans sip coffee and pretend to supervise. Instead, ACP offers something more operationally mature: a way for agents to decide when the problem is worth their effort.

For business adoption, that may be the more important capability. Not the agent that always tries. The agent that knows when trying is a bad bet.

Cognaptus: Automate the Present, Incubate the Future.

Shahar Lutati, “The Agent Capability Problem: Predicting Solvability Through Information-Theoretic Bounds,” arXiv:2512.07631, 2025. ↩︎

Budget is not an afterthought#

ACP turns agent feasibility into an information budget#

The lower bound is the discipline, not the promise#

High-confidence budgets are about reserves, not bravado#

The GP estimator is the bridge from theory to workflow design#

The experiments test calibration more than spectacle#

The business value is task triage, not agent mysticism#

Approximation is a first-class budget lever#

Where ACP applies cleanly, and where it gets slippery#

What Cognaptus would infer for agent operations#

The better agent is the one that prices ignorance#