When Black Boxes Grow Teeth: Mapping What AI Can *Actually* Do

A green block, a yellow block, and a very small number

Green on yellow. That is the task.

A tabletop robot sees a green block, a yellow block, and a few other objects. It has low-level manipulation skills. It receives a high-level instruction: put the green block on top of the yellow block. This sounds like exactly the kind of small benchmark task that modern AI agents should now handle with theatrical confidence.

The agent succeeds about 6% of the time.

Not 60%. Six.

Worse, the failure is not clean. Sometimes it picks up the entire tower. Sometimes it unstacks or knocks things down. Sometimes the visual detector becomes sensitive to the relative positions of the blocks and confuses the yellow block for the green one. The important fact is not merely that the agent fails. We already own enough AI demos to know that failure is part of the subscription package. The important fact is that the failure has structure.

A new paper, Discovering and Learning Probabilistic Models of Black-Box AI Capabilities, proposes a way to discover that structure.¹ The method, called Probabilistic Capability Model Learning, or PCML, does not simply ask whether a black-box AI agent can complete a task. It tries to learn what the agent can do, under which conditions, with what side effects, and with what outcome probabilities.

That distinction is the spine of the paper.

A pass/fail benchmark says: “Can the agent stack the block?” PCML asks a more operational question: “From which starting states does the agent reliably stack the block, what else might it disturb, and how often do those outcomes occur?”

For businesses trying to deploy agents into workflows, robots, simulations, customer operations, or software environments, that second question is the one that actually matters. The first one is mostly useful for conference slides.

Capability testing is not the same as capability modeling

The common instinct is to evaluate an agent by giving it tasks and counting successes. That is understandable. It is also too thin.

An AI agent used for sequential decision-making is not just a classifier producing one output. It acts over time. It may take different routes to the same objective. It may succeed while creating side effects. It may fail only under specific initial conditions. It may prefer one valid plan over another because of internal tie-breaking logic that nobody outside the system can see.

The paper calls these systems black-box AI systems, or BBAIs: systems that accept high-level objectives and attempt to achieve them in an environment. The examples range from language- and vision-driven agents to robot controllers and planning agents.

The authors argue that users often do not merely need a primitive action model. Knowing how a gripper moves, or how a simulator transition works, does not tell a user whether the agent can “clean the kitchen,” “make coffee,” “fetch the key,” or “stack the green block.” Those capabilities depend on the agent’s internal reasoning, planning, policy selection, perception, and execution.

So the paper shifts the target from low-level world modeling to high-level capability modeling.

A capability model in PCML contains three things:

Capability question	What PCML tries to learn	Why it matters operationally
What can the agent try to achieve?	The set of discovered intents or capabilities	Avoid assuming the agent can pursue every syntactically valid goal
When does it work?	Conditions over an interpretable abstract state	Decide when to allow, block, or route a task
What happens when it tries?	Conditional probabilistic effects, including side effects	Estimate reliability, failure modes, and downstream risk

This is the paper’s useful correction to the usual benchmark mindset. A benchmark score compresses behavior. A capability model expands it into a map.

Three uncomfortable examples make the paper more interesting than its formalism

The paper’s strongest evidence is not a theorem first, although it has theorems. It is the way PCML surfaces behaviors that a normal task-completion view would hide.

The SayCan example is the cleanest. The tabletop agent, controlled with a Llama-3.1-8B-Instruct component and low-level manipulation skills, is asked to put a green block on a yellow block. PCML learns that the intended capability succeeds only around 6% of the time. It also identifies recurring non-intent outcomes: picking up the whole tower, unstacking, knocking down objects, and perception-sensitive mistakes.

A second example comes from MiniGrid. The MiniGrid agent is implemented using GPT-4.1-mini and operates in a small 7×7 world with locked doors, a blue key, a green door, a blue door, and lava. PCML discovers that the agent may reach the northwest quadrant only 10% of the time, while also picking up an unnecessary key and opening an unnecessary door. It also finds that the agent can get the blue key from the northwest with 80% success, yet fails when starting near the key in the southwest.

That last detail is the kind of thing a manager would not find funny after deployment. “It works when far away and fails when near the target” is not an intuitive reliability profile. It is, however, exactly the kind of profile that black-box agents can have.

A third example comes from an LAO* planning agent in Blocksworld. PCML identifies a behavioral preference: when placing block A, the agent always prefers placing it on C if C is clear; otherwise, it places it on B. This is not a failure in the usual sense. It is a hidden policy preference. But hidden preferences matter when multiple outcomes satisfy a user instruction and one of them creates downstream constraints.

These cases justify the paper’s case-first reading. The formal machinery matters because it explains how such behaviors can be discovered systematically. But the business problem appears first in the cases: agents do not merely “work” or “fail.” They work conditionally, fail asymmetrically, and leave fingerprints.

PCML turns agent interrogation into active learning

PCML begins with a practical assumption set. It does not require access to the agent’s internal model. It assumes access to:

an environment or simulator;
the ability to instruct the agent to pursue a high-level intent;
an abstraction function that maps low-level environment states into an interpretable symbolic vocabulary.

That third assumption is crucial. PCML is not learning raw perception from pixels into concepts. It assumes the existence of an abstraction layer: objects, predicates, and abstract states. For example, instead of treating the entire environment as raw pixels or simulator internals, it may represent whether a robot is at a location, whether a key is held, whether a block is on another block, or whether a door is open.

Given that abstraction, PCML collects transitions of the form:

$$ \langle s, c, s' \rangle $$

where $s$ is the abstract starting state, $c$ is the capability or intent the agent is asked to execute, and $s’$ is the abstract resulting state.

The method then builds capability models using conditional probabilistic effects. In plain language, a learned rule may say:

Under condition A, executing capability C leads to outcome X with probability 0.5, outcome Y with probability 0.25, and outcome Z with probability 0.25.

This is why PDDL-style representation is not just academic decoration here. The point is to produce a symbolic model that a human can inspect, while still allowing stochastic outcomes and conditional effects. The model is not merely a success-rate table. It is closer to a probabilistic operating manual.

PCML discovers candidate capabilities from observed abstract state changes. It avoids enumerating every syntactically possible intent and instead derives plausible ones from what the agent actually changes in the environment. That is a small but important design choice. In deployment terms, it means the system is not wasting all its energy asking nonsense questions just because the symbolic language permits them. Very noble. Also necessary.

The clever move: optimistic and pessimistic models argue with each other

The core mechanism is the pair of models PCML maintains during learning.

The pessimistic model accepts only what has been observed. It is safe, narrow, and deliberately reluctant to generalize. If a transition has not appeared in the dataset, the pessimistic model does not rush to believe it.

The optimistic model generalizes over states not already claimed by another observed partition. It is broader and more willing to imagine that the agent can behave similarly in untested states.

The gap between these two models is the search space. When the optimistic and pessimistic models disagree, PCML has found uncertainty worth probing.

This gives the algorithm a simple interview strategy:

Model pair behavior	What it means	What PCML does
Pessimistic and optimistic models agree	Current data already constrain the behavior	Stop probing that region or move elsewhere
They disagree on possible outcomes	The agent may behave differently in unobserved conditions	Synthesize a query to expose the difference
New transition appears	The capability model was incomplete	Update the dataset and rebuild the model pair
No new information appears repeatedly	The current query budget is no longer producing useful discoveries	Stop according to the early stopping rule

The paper uses Monte Carlo Tree Search to synthesize queries that are likely to distinguish the two models. This is the “active” part of active learning. Instead of randomly throwing tasks at the agent and hoping something interesting happens, PCML searches for policies that drive the agent into parts of the abstract state space where the two models predict different outcomes.

There are two implementations.

PCML-E uses exact distributions over abstract states, represented compactly with bit vectors and bitwise operations. It uses total variation distance to measure how different the predicted distributions are.

PCML-S uses a sample-based Set-of-Support formulation. Its key insight is that if a state appears in the support of one model’s prediction but not the other’s, observing that state is immediately diagnostic. So it focuses attention on the symmetric difference between the supports of predicted distributions, rather than carrying around every detail of the full distribution.

The distinction matters because exact symbolic probability tracking can become expensive. PCML-S trades exactness for a more scalable sampling strategy, and in most reported experiments it performs better. Not always, though. Overcooked is the useful exception: PCML-E outperforms PCML-S there because the domain has fewer common states across the MCTS tree, so PCML-S’s sampling does not buy much.

That exception is not a weakness in the paper. It is a reminder that algorithmic elegance still has to pay rent in a particular state space.

What the experiments are actually doing

The empirical section answers two main questions. First, does PCML reduce uncertainty about the agent’s capabilities over time? Second, do the learned models qualitatively match meaningful agent behaviors?

The paper evaluates PCML on several environments and agents: Overcooked, MiniGrid, SayCan, LAO*/PDDLGym domains, rendered Blocksworld, Tireworld, Probabilistic Elevators, and Blocksworld. It compares PCML against a random exploration baseline that uses the same model-learning framework but replaces active query selection with random capability sequences.

Here is the cleanest way to read the evidence:

Evidence component	Likely purpose	What it supports	What it does not prove
Figure 3 examples: MiniGrid, SayCan, LAO*	Main qualitative evidence	PCML can reveal interpretable capability limits, side effects, and hidden preferences	It does not prove all such agents can be fully characterized cheaply
Figure 4 variational distance curves	Main quantitative evidence	PCML reduces learned-model mismatch faster than random exploration across the main evaluated problems	The true model is still approximated through sampled evaluation data
Random exploration baseline	Comparison with unguided probing	Active query synthesis is doing useful work beyond merely collecting more data	It does not compare against every possible testing or verification method
Appendix Figure 5 additional domains	Robustness and exploratory extension	PCML retains advantages or sensible plateau behavior across more stochastic and difficult domains	It does not remove dependence on abstraction quality or simulator access
Theorems 1–5	Formal guarantee under assumptions	Soundness, completeness, and convergence hold when the abstract model is finite and expressive enough	They do not guarantee easy real-world deployment in open-ended settings
Implementation details and hyperparameters	Reproducibility / implementation detail	Shows how compact distributions, state sampling, and MCTS settings were configured	Not a separate empirical claim about business value

The quantitative results are modestly but meaningfully reported. In MiniGrid, PCML-S achieves approximately 60% lower sampled variational distance than the random baseline. In SayCan, the improvement is around 20%, lower because the agent itself is highly stochastic. In Overcooked, PCML-E achieves about 60% lower variational distance than random exploration. In First Responders, random exploration performs particularly poorly because several transitions are reachable only through specific capability sequences.

The appendix extends the story. In Tireworld and Blocksworld, PCML consistently outperforms random exploration and converges to lower variational distance faster. In Blocksworld, random exploration appears competitive early but eventually needs 19× more capability executions to reach the same variational distance as PCML. Rendered Blocksworld and Probabilistic Elevators are less flattering but still informative: high stochasticity or limited ability to intentionally reach distinguishing states causes PCML to plateau after it exhausts the transitions the agent can realistically generate.

That plateau matters. A flat curve is not always failure. Sometimes it means the agent cannot be steered into new diagnostic states. PCML can interrogate an agent; it cannot magically grant the agent agency it does not have.

The business value is capability auditing, not prettier benchmarking

The most direct business interpretation is not “use PCML and your agents become safe.” They do not. The universe remains inconsiderate.

The practical value is cheaper, more structured diagnosis before deployment.

Many organizations are moving from chat-style AI tools toward agentic systems: AI that uses tools, triggers workflows, navigates software, controls robots, or executes multi-step procedures. In such settings, the key question is rarely “Can it succeed on a demo?” The better question is:

Under what conditions can we safely ask this system to do this job, and what are the likely side effects?

PCML points toward an audit workflow:

Deployment step	PCML-style contribution	Business interpretation
Define the controlled environment	Simulator or testbed where the agent can be queried	Build a safe pre-deployment arena
Define abstraction	Map raw states into business-relevant predicates	Decide what outcomes are visible to humans
Interrogate the agent	Generate active queries that expose model uncertainty	Spend testing budget on informative cases
Learn capability model	Represent conditions and probabilistic effects	Produce an interpretable capability map
Use the model operationally	Permit, block, route, or monitor tasks based on learned conditions	Turn testing into policy, not just a PDF report

This has obvious relevance for robotics, warehouse automation, software agents, healthcare workflow assistants, and any environment where agent failure creates operational cost. The model does not need to be perfect to be useful. Even a partial capability map can reveal that a task should be restricted to certain states, that a fallback should be required, or that a supposedly harmless instruction creates side effects.

There is also a procurement angle. If two vendors both claim their agent can “handle scheduling,” “prepare reports,” or “operate a machine,” the meaningful comparison is not a demo video. It is a capability map under shared test conditions. Vendor A may succeed more often; Vendor B may fail more gracefully; Vendor C may technically succeed while making a mess elsewhere. Capitalism loves a feature checklist. Operations needs a failure topology.

The paper’s formal guarantees are useful but conditional

The theoretical results are not decorative. They define what it means for the learned model to be sound and complete with respect to observed data, and they show convergence under finite and expressible abstract models.

The most important guarantee is conceptual: PCML’s pessimistic and optimistic models form lower and upper bounds over models consistent with the collected dataset. When they coincide, the remaining uncertainty collapses. With enough coverage and sampling, if the true capability model is expressible in the chosen predicates and objects, the learned model converges toward the true model in variational distance.

That is powerful, but it is not magic.

The guarantee depends on the abstraction being good enough. If the predicates omit the feature that actually determines the agent’s behavior, PCML cannot represent that missing cause. It can only learn within the vocabulary it has been given. A business version of this problem is easy to imagine: if a customer-service agent behaves differently depending on account history, but the abstraction only includes current ticket category, the model will look noisy because the relevant condition is invisible.

The paper also assumes a finite abstract state space. That is reasonable for the formal setup and many controlled simulations. It becomes harder in open-ended business processes where new entities, documents, tools, and exception types appear over time.

Finally, the paper notes a subtle future-work problem: real-world BBAIs may have implicit or context-dependent preferences between multiple valid plans. Distinguishing a genuine structural constraint from a mere preference can be difficult. If an agent always places block A on C when C is clear, is that because it cannot reason about placing A on B, or because its planner’s tie-breaking rule prefers C? For operations, both may matter. For generalization, the difference matters even more.

Where PCML fits in the AI governance stack

PCML is not a replacement for safety verification, red-teaming, monitoring, or human oversight. It sits in a different layer.

Traditional testing tries to find failures. Verification tries to prove properties. Monitoring tries to detect behavior during use. PCML tries to discover the agent’s capability model: the conditional, probabilistic structure of what the agent can actually execute.

That makes it especially useful before deployment and during redesign. It can help answer questions such as:

Which tasks should this agent be allowed to attempt automatically?
Which starting conditions require human approval?
Which side effects should trigger monitoring?
Which capabilities are unreliable because of perception, planning, or stochastic execution?
Which observed behaviors suggest hidden preferences rather than hard limitations?

The business value is not that every organization will implement PCML exactly as described. Many will not have a clean simulator, a mature symbolic abstraction, or the patience to run two-day experiments. The value is the audit pattern: move from outcome counting to conditional capability modeling.

That pattern is likely to survive even if the implementation changes.

The real lesson: stop asking agents for résumés

Agents are very good at producing confident descriptions of what they can do. So are interns, vendors, and consultants. This is why civilization invented probation periods.

PCML is a probation period with a symbolic notebook. It watches the agent act, creates targeted tests, records what changes, and turns the evidence into an interpretable probabilistic model.

The SayCan robot that stacks green on yellow only about 6% of the time is not just a failed demo. It is a warning about the wrong evaluation question. “Can it stack?” is too blunt. “When does it stack, what does it disturb, and how often?” is the question that makes the black box useful enough to govern.

For Cognaptus readers, the immediate implication is straightforward: as agentic AI moves into operations, capability auditing becomes a business discipline. Not a compliance slogan. Not a benchmark leaderboard. A practical process for mapping where automation can be trusted, where it needs guardrails, and where the system is still improvising with a wrench in its mouth.

The black box is growing teeth. PCML offers one way to count them before they bite the workflow.

Cognaptus: Automate the Present, Incubate the Future.

Daniel Bramblett, Rushang Karia, Adrian Ciotinga, Ruthvick Suresh, Pulkit Verma, YooJung Choi, and Siddharth Srivastava, “Discovering and Learning Probabilistic Models of Black-Box AI Capabilities,” arXiv:2512.16733, 2025. https://arxiv.org/abs/2512.16733 ↩︎

A green block, a yellow block, and a very small number#

Capability testing is not the same as capability modeling#

Three uncomfortable examples make the paper more interesting than its formalism#

PCML turns agent interrogation into active learning#

The clever move: optimistic and pessimistic models argue with each other#

What the experiments are actually doing#

The business value is capability auditing, not prettier benchmarking#

The paper’s formal guarantees are useful but conditional#

Where PCML fits in the AI governance stack#

The real lesson: stop asking agents for résumés#