TL;DR for operators
Enterprise agents fail less like philosophers and more like junior coordinators with access to the wrong dropdown menu.
They propose actions that are not currently possible. They miss actions that are possible. They forget that an action changes the world. They treat impossible future states as if determination will somehow make them available. They add redundant steps, skip mandatory subgoals, or pick a next move that feels plausible but does not reduce the distance to the goal.
That is the practical force of ACPBench Hard, a benchmark from Kokel, Katz, Srinivas, and Sohrabi that tests whether language models can perform the atomic reasoning tasks that automated planners rely on.1 The benchmark is not asking models to write a pretty plan. It asks them to generate the planner’s working parts: all applicable actions, state changes, unreachable facts, unreachable actions, invalid plan steps, removable actions, landmarks, and goal-improving next actions.
The results are not flattering. Small and medium models are essentially unusable on action applicability and action reachability. Large models improve, but unevenly. Reasoning models help on some tasks, especially progression and next-action selection, but still perform badly on several primitives. The most revealing failure is not that models struggle with “planning” in the abstract. It is that they often cannot reliably identify which actions can be performed now.
For businesses building autonomous agents, this changes the evaluation question. Do not ask only whether the agent completed a polished demo. Ask whether it can pass the mundane internal checks a planner needs before it is allowed to act:
| Operator question | Planner primitive | Business failure if missed |
|---|---|---|
| What can be done now? | Applicability | Tool misuse, invalid workflow steps, bad API calls |
| What changes after this step? | Progression | Stale state, duplicated work, broken handoffs |
| What can never happen from here? | Reachability | Wasted search, impossible promises |
| Which future actions can never become available? | Action reachability | Dead-end workflows, invalid escalation paths |
| Where does this plan first break? | Validation | Hidden infeasibility in generated procedures |
| Which steps are redundant? | Justification | Bloated automation, unnecessary cost |
| What must happen on every solution path? | Landmark | Skipped dependencies, missing approvals |
| Which next step moves us closer to the goal? | Next action | Plausible but directionless agent behaviour |
The business implication is not “throw away LLM agents.” That would be neat, dramatic, and wrong. The implication is more useful: LLMs should not be treated as reliable planners unless their planning primitives are tested, constrained, and validated. Where a symbolic model exists, use it. Where it does not, build validators around the specific action-state logic that matters. Charisma is not a control system.
The failure is not strategy; it is the action interface
A typical agent demo starts with a goal: “Book the trip,” “resolve the ticket,” “prepare the report,” “reconcile the invoice,” “schedule the maintenance visit.” The model decomposes the goal, calls tools, and narrates progress with the serene confidence of a consultant who has not yet met procurement.
This makes planning look like high-level strategy. In practice, planning is mostly governed by lower-level rules. A workflow step is valid only if its preconditions hold. A tool call changes the state. Some facts are impossible from the current state. Some actions will never become available. Some subgoals cannot be skipped. Some plans are not plans; they are wish lists with timestamps.
ACPBench Hard is valuable because it moves attention from the theatrical surface of agent planning to the action interface underneath. The authors build on earlier ACPBench work, but remove the comfort of boolean and multiple-choice answers. Instead of asking whether a specific action is applicable, ACPBench Hard asks the model to generate all applicable actions. Instead of choosing from a short list, the model has to produce the answer a planner would actually consume.
That distinction matters. Multiple-choice tests can hide weakness by narrowing the action space. Real systems rarely offer such mercy. A procurement agent does not get four lovingly curated options, one of which is correct. It must know whether the supplier exists, whether the invoice is approved, whether the amount exceeds a threshold, whether the purchase order has already been matched, and which operation is legal next.
This is where the “LLMs can plan if prompted well enough” story becomes too smooth. Prompting can help with format and decomposition. It does not automatically supply the missing semantics of an action system.
Planning is a stack of small promises
The paper’s strongest editorial lesson is that planning should not be treated as one capability. It is a stack of smaller promises. Each promise can fail independently.
At the base is applicability: given the current state, identify every action that can be performed. This sounds almost insultingly basic. It is not. If a model misses a valid action, the planner may lose completeness: there may be a solution, but the system never explores it. If the model invents an invalid action, the planner loses soundness: it may produce a plan that cannot actually execute. In business language, the first failure is missed opportunity; the second is operational nonsense with an audit trail.
Then comes progression: after an action is performed, which facts become true and which stop being true? This is the state-tracking layer. If the model fails here, the agent operates on a fictional world. It may believe an item remains in inventory after allocation, an approval remains pending after approval, or a customer has not been contacted after a message has already been sent. The machine equivalent of “I thought someone else had done it” is not more charming because it arrived via API.
Above that are reachability and action reachability. Reachability asks which facts can never become true from the current state. Action reachability asks which actions can never become applicable. These are not decorative reasoning tasks. They are pruning mechanisms. A planner that cannot detect impossible states wastes effort pursuing dead ends. An agent that cannot detect unreachable actions may keep trying to unlock a workflow path that the current state can never permit.
At the plan level, validation asks where a proposed plan first breaks. Justification asks whether redundant actions can be removed without destroying the plan. Landmarks ask which facts must become true along every valid solution path. Finally, next action asks for a step that moves the state closer to the goal, not merely a step that sounds productive.
The point is not that every enterprise workflow should be expressed in PDDL by lunchtime. The point is that action-state logic exists whether the organisation models it explicitly or not. A customer-service agent still has preconditions. A finance workflow still has state transitions. A logistics system still has impossible moves. Ignoring that structure does not remove it. It only transfers the structure into the model’s vibes department, where good governance goes to die quietly.
ACPBench Hard removes the answer menu
The benchmark contains 1,040 open-ended questions: 10 questions for each of 8 tasks across 13 PDDL domains. The domains include familiar planning settings such as ferry, logistics, blocksworld, grid, rovers, satellite, and AlfWorld-style tasks. The questions are rendered in natural language using templates, while the underlying problems retain their symbolic PDDL structure.
This design is not merely aesthetic. It gives the benchmark two useful properties.
First, the input is readable by language models. The model receives a natural-language description of the domain, state, available actions, and question. That keeps the benchmark close to the way many LLM-agent systems are actually prompted.
Second, the answer can be validated symbolically. The authors do not ask another model to judge whether an answer “looks right.” They build task-specific validators. For some tasks, validation is straightforward: compare the generated set of actions or effects with the stored correct set. For others, validation requires solving planning problems. Reachability, action reachability, landmarks, and next action can require PSPACE-complete checks.
That is a major contribution. Generative benchmarks often collapse at evaluation time because open-ended answers are messy. ACPBench Hard treats evaluation as part of the scientific object. If the benchmark asks for planner-like outputs, the scoring process must be planner-aware. Using an LLM judge for this would be delightfully circular: asking one uncertain planner to grade another uncertain planner. The authors decline that little ceremony.
The paper’s implementation choices also matter. The experiments use two-shot prompting with static examples outside the evaluation set, specify the expected response format, and parse outputs with a lenient grammar. This reduces the chance that poor scores are merely formatting failures. Some parsing issues remain, but the core result is not that models cannot print parentheses properly. The core result is that they often generate the wrong planning content inside the parentheses.
The hard parts are not the dramatic parts
The experimental results split the planner primitives into a pattern that should make operators uncomfortable.
For small and medium models, progression is the easiest task, but even the best score among those models reaches only 43%. Applicability and action reachability are disastrous: none of the small or medium models scores above 2% on those two tasks. The remaining tasks sit in a low middle band.
Large models improve, but not into operational reliability. GPT-4o is the strongest non-reasoning model across several tasks, scoring 25% on applicability, 1% on action reachability, 54% on justification, 29% on landmarks, 55% on next action, 78% on progression, 32% on reachability, and 62% on validation. DeepSeek V3 beats GPT-4o on some tasks, including action reachability and justification, but this is not a rescue. Action reachability remains low across the board.
Reasoning models show sharper improvements in certain places. o1-preview reaches 89% on progression and 80% on next action, with 66% on reachability and 56% on landmarks. But it still reaches only 44% on applicability and 12% on action reachability. o1-mini scores 78% on validation, but only 6% on action reachability. DeepSeek R1 scores 77% on progression and 53% on validation, but 5% on applicability and 1% on action reachability.
This is the useful unpleasantness of the paper. The weakest points are not only exotic, multi-step reasoning puzzles. One of the weakest points is action applicability: identify all actions that can be performed now. That is the interface between thought and execution.
The paper also compares the generative task format with the earlier boolean and multiple-choice formats using GPT-4o. The generative format produces significantly higher error rates than boolean and multiple-choice versions, except for validation. That result has a clear interpretation: much of the apparent competence in easier benchmark formats may come from answer scaffolding. Remove the menu, and the model has to generate the planner’s working set itself. The magic thins.
Exact action lists are brutal because planners need soundness
The applicability task is especially instructive because it exposes a trade-off that business teams often blur.
ACPBench Hard scores applicability by exact set match. The model must generate precisely all applicable actions. Missing actions are wrong. Extra actions are wrong. That may sound harsh until one asks what a planner is supposed to do with the output.
If the model returns only a subset of valid actions, the planner may remain sound but lose completeness. It will not execute impossible actions, but it may miss solutions. In an enterprise workflow, that means the agent becomes conservative in a strange, opaque way. It may fail to find a valid route through a process that a human could complete.
If the model returns invalid actions, the planner may lose soundness. This is worse. The system may build a plan that contains illegal tool calls, impossible handoffs, or operations whose preconditions are not met. In regulated or operationally sensitive domains, that is not a “model limitation.” It is a control failure with better branding.
The authors test a less strict Jaccard-style metric for the best-performing model on applicability, o1-preview. The mean score rises from 0.44 under exact match to 0.57 under Jaccard similarity. That is informative, but not exonerating. A partial action list can be useful if the system is designed to tolerate incompleteness. It is not enough if the agent is expected to enumerate legal next moves reliably.
This is where many enterprise AI evaluations become too forgiving. A demo rewards plausible forward motion. A planner requires correct admissible motion. Those are not the same product requirement.
The evidence is not one leaderboard; it is a diagnostic map
The paper includes several experimental components, and they should not be read as equal claims. Their roles differ.
| Paper component | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Main benchmark tables across model sizes | Main evidence | ACPBench Hard is difficult across model families; weaknesses vary by task | That no LLM architecture can ever improve on these primitives |
| GPT-4o domain-wise analysis | Diagnostic evidence | Performance depends on both task and domain; no domain is universally easy | That GPT-4o’s domain pattern generalises to all models |
| Generative vs boolean/multiple-choice comparison | Comparison with prior format | Open-ended planner outputs are much harder than selecting from constrained options | That boolean or MCQ planning tests are useless |
| PDDL vs natural language vs combined representation | Representation sensitivity test | Symbolic representations help performance, especially when combined with natural language | That LLMs should replace symbolic planners when PDDL is available |
| Next-action split by domain and distance | Exploratory diagnostic extension | Next-action performance varies sharply by domain and sometimes behaves unexpectedly | That next-action success alone is enough for reliable planning |
| Less strict applicability scoring | Sensitivity test | Some models produce partially useful action sets | That partial action generation is safe in execution systems |
The representation experiment is particularly revealing. DeepSeek V3 performs better when the prompt includes PDDL, and best when it receives both PDDL and natural language. The average score rises from 0.39 with natural language alone to 0.44 with PDDL and 0.47 with both.
That improvement has a beautifully inconvenient implication. If a formal PDDL model is available, then the organisation may already have the structure needed for a classical planner. At that point, using the LLM as the planner component may be less attractive than using it around the planner: translating user intent, explaining plans, handling exceptions, or mediating between messy language and formal state.
In other words, the more symbolic structure you give the model, the better it does. But the moment you have enough symbolic structure, the case for letting the model improvise the planning logic becomes weaker. There is a small irony here. Naturally, it has excellent posture.
What Cognaptus infers for business systems
The paper directly shows that current tested models perform poorly or inconsistently on open-ended planner primitives in classical planning domains. It does not directly test a bank’s loan-processing bot, a hotel operations assistant, a construction procurement workflow, or a customer-support agent with tool access. Those are Cognaptus-level applications, not paper-level claims.
Still, the translation is straightforward because enterprise workflows also depend on actions, states, preconditions, effects, and goals.
| What the paper directly shows | Cognaptus inference for business use | Boundary |
|---|---|---|
| Open-ended action generation is much harder than constrained answer selection | Agent tests should include free-form generation of valid tool calls, not just multiple-choice routing | Tool APIs may impose schema constraints that reduce the action space |
| Applicability is weak even in large and reasoning models | Do not let models decide legal next actions without runtime checks | Some domains have simpler preconditions and may be safer |
| Progression is easier but still imperfect | Agents need external state stores and post-action reconciliation | The benchmark uses deterministic state transitions |
| Action reachability is consistently difficult | Agents may waste cycles pursuing impossible workflow branches | Business processes may include human overrides not captured in formal models |
| PDDL improves model performance | Formal process models can improve agent reliability | If the formal model exists, a symbolic planner may be the better planner |
| Symbolic validators make open-ended evaluation possible | Build validators for the parts of the workflow that carry operational risk | Some real-world states are ambiguous, incomplete, or probabilistic |
The operational recommendation is not to ban LLMs from planning-like systems. It is to narrow their authority.
Use LLMs to interpret messy requests, propose candidate goals, summarise state, explain plans, and interact with humans. Use symbolic or rule-based components to check preconditions, state transitions, reachability, and compliance constraints wherever those structures are available. Where they are not available, start by building smaller validators around high-risk decisions.
A practical enterprise evaluation suite should ask the agent questions like these:
- Given this current workflow state, list every legal next tool call.
- After this tool call, list exactly which state fields change.
- Identify any requested goal that is impossible without a missing prerequisite.
- Find the first invalid step in this generated plan.
- Remove redundant steps without changing the successful outcome.
- Name the approvals, documents, or state changes that every valid route must include.
- Choose the next action and explain which measurable distance-to-goal it reduces.
That is a more useful test than asking whether the agent “completed the task” in a sandbox built to forgive it. Sandboxes often measure whether a model can muddle through. Operations require knowing where the muddle begins.
The planner should be a system, not a personality trait
A tempting response to ACPBench Hard is to reach for larger reasoning models. The paper gives only partial comfort there. Reasoning models do improve some tasks, but the improvement is uneven and expensive. The authors also note the practical cost issue: larger and reasoning models can be too expensive to use as planner components, especially when planning requires repeated calls.
That cost point matters because planning is iterative. A planner does not ask one question. It asks many: what is applicable, what changes, what is reachable, which action reduces distance, did the plan break, can the plan be simplified? If every internal planner primitive requires an expensive model call, the architecture becomes operationally unattractive before it becomes reliable.
The more sensible architecture is hybrid.
The LLM should not be the only bearer of planning logic. It should sit inside a system that includes:
- typed action schemas;
- explicit preconditions and effects;
- state tracking outside the model context window;
- validators for generated actions and plans;
- fallback symbolic planners where formal models exist;
- audit logs showing why an action was considered legal;
- refusal paths when the model cannot generate a valid next step.
This does not make the system less “agentic.” It makes it less silly. Agency without valid action semantics is just confident autocomplete with tools attached.
The boundary: classical planning is a diagnostic, not the whole world
ACPBench Hard uses classical planning domains: closed-world assumptions, deterministic dynamics, full observability, and PDDL-style action descriptions. Many business environments are messier. Information may be incomplete. Human responses are uncertain. Rules change. Tool outputs can be delayed or inconsistent. Some actions are probabilistic. Some goals are negotiated rather than formally specified.
That boundary cuts both ways.
On one hand, the benchmark does not prove that every enterprise LLM agent will fail in the same way. A real system may constrain tool calls, include human review, use retrieval, rely on workflow engines, or operate in a narrow domain where the action space is small.
On the other hand, classical planning is cleaner than most business reality. If models struggle to generate valid planner primitives in a structured deterministic setting, it is not obvious why they should be trusted more in a messier environment without additional controls. The benchmark is not a full simulation of enterprise operations. It is a diagnostic stress test for the logic that enterprise operations also need.
The paper also evaluates final answers, not the internal reasoning process. A model may reason sensibly and output badly, or reason badly and output luckily. The benchmark is intentionally concerned with the output because planners consume outputs. In production, this is the right bias. The warehouse system does not care whether the model’s hidden chain of thought was spiritually on the right track if the resulting action is impossible.
There is also prompt sensitivity. The authors use the same prompt strategy across models, with two-shot examples and temperature zero. Different prompting, domain-specific training, tool feedback, or agent scaffolding could change performance. That is not a weakness of the paper’s claim; it defines the next engineering question. If a team believes its agent scaffolding fixes these primitives, it should test that claim directly.
The useful conclusion is not pessimism; it is instrumentation
The paper’s title says “Hard,” and for once the adjective earns its salary. But the practical conclusion is not that language models are useless for planning. It is that planning competence has to be instrumented.
End-to-end agent success is a lagging indicator. It tells you whether the system survived a particular path. ACPBench Hard points toward leading indicators: whether the model can enumerate legal actions, update state, reject impossible branches, identify mandatory subgoals, and select useful next moves. These are the checks that reveal whether an agent is planning or merely narrating action.
For enterprise AI, that is a useful shift. It moves evaluation away from theatre and toward failure diagnosis. Instead of asking, “Can the model plan?”, ask:
- Which planner primitive does this workflow require?
- Can the model perform that primitive without answer choices?
- Can we validate the output symbolically or procedurally?
- What happens if the model returns an incomplete set?
- What happens if it returns an invalid action?
- Which failures are acceptable, and which create operational or compliance risk?
That last distinction matters. Missing a harmless optional action is one thing. Inventing an invalid financial approval step is another. Treating both as “planning errors” is analytically lazy and operationally dangerous.
The quiet lesson of ACPBench Hard is that autonomous agents need rules before they need swagger. LLMs can help interpret, communicate, and sometimes propose. But when a system must decide what can legally happen next, logic is not an old-fashioned accessory. It is the lock on the machinery.
Cognaptus: Automate the Present, Incubate the Future.
-
Harsha Kokel, Michael Katz, Kavitha Srinivas, and Shirin Sohrabi, “ACPBench Hard: Unrestrained Reasoning about Action, Change, and Planning,” arXiv:2503.24378. ↩︎