Stacking the Odds: Why Blocksworld Still Breaks Your Fancy LLM Agent

A robot arm, a few colored blocks, and a table. That is the setup. No messy warehouse, no sensor dust, no tired operator, no forklift reversing into the wrong aisle. Just blocks.

And still, the fancy LLM agent stumbles.

That is the useful discomfort in Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol.¹ The paper does not show a robot revolution. It shows something more valuable for anyone trying to deploy LLM agents in industrial workflows: even in a symbolic world where the rules are explicit, the actions are discrete, the state can be queried, and the tool interface is standardized, reliability degrades as soon as the task stops being politely simple.

The headline result is not that an OpenAI o3-based agent can solve some Blocksworld tasks. Of course it can. Give a strong reasoning model four primitive actions and a tidy environment, and it will often behave impressively enough for a demo video.

The real result is the shape of the degradation.

In the benchmark’s single-agent baseline, success falls from 80% in basic scenarios to 60% under partial observability. Average time rises from 76 seconds in basic tasks to 732 seconds in additional-constraint tasks and 676 seconds under partial observability. Planning attempts rise from 1.1 to 3.1. Average token use moves from roughly 35,100 total tokens in basic tasks to about 192,200 tokens in partial-observability tasks. The agent is not merely failing more often. It is spending more reasoning budget, making more correction attempts, and still becoming less dependable.

That is the pattern businesses should notice. The failure mode is not “the agent cannot think.” It is worse: the agent can think, can call tools, can verify plans, can receive error messages, and can still become operationally fragile when constraints and incomplete information enter the room. In other words, the agent has enough intelligence to look plausible and enough uncertainty to be expensive.

The baseline result is the argument

The paper evaluates a single ReAct-style LangGraph agent using OpenAI’s o3 model snapshot o3-2025-04-16. The agent has access to a Model Context Protocol server that wraps the Blocksworld simulation tools. Its instructed workflow is sensible: get the rules, get the current state, generate a plan, verify the plan, correct errors using feedback, execute the plan, and confirm success.

That setup matters because it rules out the lazy excuse that the agent was “not allowed to use tools.” It was. It had information tools, a plan-verification tool, and execution tools. The environment returned structured state and natural-language explanations for constraint violations. The benchmark is almost generous.

Yet the average category-level results look like this:

Scenario category	Likely purpose in the experiment	Success rate	Avg. time	Avg. planning attempts	Avg. total tokens	Avg. cost
C1 Basic	Main evidence: baseline planning and execution under simple rules	80%	76 s	1.1	35,126	$0.08
C2 Non-constructive actions	Main evidence: tests whether the agent can temporarily move away from the goal to reach it	70%	290 s	1.7	111,721	$0.30
C3 Impossible	Main evidence: tests recognition of unsolvable cases under constraints	100%	125 s	1.8	18,195	$0.08
C4 Additional constraints	Main evidence: tests reasoning under extra stacking rules such as block size	70%	732 s	2.2	143,715	$0.50
C5 Partial observability	Main evidence: tests state inference and querying when only top blocks are visible	60%	676 s	3.1	192,245	$0.63

The numbers are small enough to fit in a table and large enough to spoil a procurement deck.

The basic category is not perfect. An 80% success rate in a toy symbolic world is already a warning. But the more revealing comparisons are across categories. Category 2 requires non-constructive actions: the agent must temporarily move blocks in ways that do not immediately build the final arrangement. This is where planning stops looking like greedy construction. The success rate drops to 70%, time nearly quadruples, and token consumption more than triples.

Category 4 adds block-size constraints. The agent still succeeds 70% of the time, but the cost of getting there changes sharply: 732 seconds on average, 2.2 planning attempts, and over 143,000 total tokens. The paper reports systematic difficulty in scenarios requiring extensive non-constructive actions and occasional misclassification of solvable problems as unsolvable.

Category 5 is even more telling. Under partial observability, only the top two blocks in each stack are visible. The agent must query, infer, and act with incomplete state knowledge. Success drops to 60%. Planning attempts reach 3.1. The paper also notes that faulty tool executions appear here for the first time, including tool calls with incorrect block names as arguments.

That is not a small cosmetic failure. In a symbolic planning environment, calling a tool with a non-existent or wrong object name is the baby version of sending an instruction to the wrong device, wrong inventory item, or wrong workcell. The blocks are cute. The failure class is not.

Blocksworld is not childish; it is diagnostic

It is tempting to dismiss Blocksworld because it looks like an undergraduate AI exercise. That reaction is understandable and mostly wrong.

Blocksworld is useful because it strips planning down to the skeleton. There are blocks, stacks, a gripper, and four primitive actions: pick_up, put_down, stack, and unstack. Actions have preconditions. Blocks cannot be moved if something sits on top of them. The gripper holds only one block at a time. In this benchmark, table space is also finite, so intermediate moves are not free.

The simplicity is not a weakness. It is the microscope.

A warehouse robot does not fail only because warehouses are complicated. It also fails because it must maintain state, respect constraints, recover from bad intermediate choices, and distinguish impossible goals from goals that are merely inconvenient. Blocksworld tests these capabilities in a form where mistakes are visible rather than buried under a thousand real-world confounders.

The paper’s benchmark varies complexity along four dimensions:

Complexity dimension	What it tests	Operational analogue
Number of blocks and misplaced blocks	Search-space growth and required movement count	Larger task queues, more inventory locations, more dependencies
Non-constructive actions	Whether the agent can move away from the immediate goal to make future progress	Temporary staging, rework, clearing buffers, moving items before final placement
Additional constraints	Whether the agent respects domain-specific validity rules	Weight, size, compatibility, safety, tool-access, or sequencing restrictions
Partial observability	Whether the agent can operate when state is incomplete	Hidden inventory, sensor limits, stale databases, occluded objects, uncertain machine status

This is where the benchmark becomes business-relevant. Most enterprise agent demos implicitly assume Category 1. Real operations live somewhere between Category 2 and Category 5. They require temporary regressions, rule compliance, partial state, and recovery from failed actions. That is exactly where the baseline begins to look less like magic and more like a junior planner with a corporate card for token spend.

The benchmark’s real contribution is executable evaluation, not another static puzzle set

The paper’s first contribution is architectural: it introduces an executable Blocksworld benchmark connected through MCP. This is more important than it sounds.

Many planning benchmarks ask models to output a plan. The plan may then be checked after the fact. That is useful for measuring static reasoning, but it misses the operational loop: query state, propose actions, verify, execute, observe feedback, revise. A plan that looks coherent in text is not necessarily executable in a changing environment. The difference between “sounds right” and “runs correctly” is where automation budgets go to die.

This benchmark has three layers.

First, the Blocksworld simulation provides the symbolic environment. It tracks the state, validates actions, handles constraints, and exposes the environment through a REST API. The simulation includes scenario definitions with initial configurations, goal configurations, constraint sets, and metadata such as category, known solution length, number of blocks, stack positions, and misplaced blocks.

Second, the REST API is wrapped by an MCP server. This matters because MCP provides a standardized tool interface. Agents can discover and call tools without the benchmark being rewritten for each implementation. In the paper’s architecture, the MCP server exposes tools for information, verification, and execution.

Third, different agent implementations can be connected through an MCP host. The paper demonstrates this with a single ReAct agent, but the architecture is intentionally modular. A hierarchical multi-agent planner, a hybrid symbolic-LLM planner, or a more rigid workflow-controlled agent could be evaluated against the same environment.

This is why the benchmark is not merely “Blocksworld, but with LLMs.” It is closer to a small operational test rig. The point is not to admire the blocks. The point is to observe whether an agent can survive the full cycle of planning and control.

Verification helps, but it does not make the agent safe

One of the paper’s most practically useful design choices is the verify_plan tool. It checks a complete action sequence without modifying the simulation state. If the plan violates constraints, the tool identifies the first incorrect action and gives a natural-language explanation.

For business readers, this is the right instinct. Do not let the agent go straight from “I have a plan” to “I touched the system.” Insert a verification layer. Check preconditions. Simulate before execution. Return actionable feedback. Make the agent repair the plan before acting.

But the baseline results also show why verification is not a magic amulet. The agent had verification and still failed. It sometimes needed multiple attempts. Under harder categories, it consumed far more time and tokens. Under partial observability, it began making faulty tool calls.

That distinction is essential. Verification reduces one class of failure: executing a plan that is obviously invalid under known rules. It does not guarantee that the agent will search efficiently, infer hidden state correctly, know when a solvable task is solvable, or avoid malformed tool usage under cognitive load.

A practical agent stack should therefore treat verification as one layer, not the safety strategy. The stronger pattern is:

Layer	Role	What it catches	What it may miss
Rule retrieval	Gives the agent current constraints	Outdated assumptions about valid actions	Misinterpretation of rules
State query	Grounds planning in current environment state	Planning from stale or imagined state	Hidden state and incomplete sensing
Plan verification	Tests action sequence before execution	Invalid preconditions and constraint violations	Poor strategy, excessive cost, unsolved hidden-state problems
Stepwise execution	Applies primitive actions with feedback	Runtime action rejection	Cascading bad recovery
Post-execution confirmation	Checks goal achievement	Premature “done” claims	Inadequate goal specification

The paper is useful because it evaluates the loop rather than praising the existence of one component. A weaker article would say “MCP enables tool use.” Fine. A better reading says: standardized tool use makes failures easier to compare, measure, and debug. That is more valuable and less LinkedIn-friendly, so naturally it is the part people may skip.

The five categories expose different failure economics

The evidence-first lesson is that not all difficulty is the same.

Category 2, non-constructive actions, increases the need for lookahead. A naive planner wants each move to visibly build the goal. But many real tasks require temporary disorder: moving a pallet out of the way, staging components, clearing a machine, or undoing a partial arrangement before rebuilding it correctly. The paper’s baseline suggests this kind of “move backward to move forward” reasoning already degrades performance.

Category 4, additional constraints, changes the validity landscape. The block-size rule permits stacking only when a smaller block is placed on an equal or larger one. This resembles operational constraints such as weight limits, compatibility matrices, safety rules, machine capabilities, or material-handling restrictions. Here, the agent’s success rate does not collapse relative to Category 2, but the resource burden rises sharply. That means the business issue may not only be accuracy. It may be throughput, latency, and cost.

Category 5, partial observability, attacks the agent’s world model. If only the top two blocks are visible, the agent cannot simply read the whole arrangement and plan once. It must gather information, maintain uncertainty, and avoid hallucinating hidden objects. This is where the benchmark reports the lowest success rate and the first faulty tool executions. That is a different failure class from “the plan was too long.” It is closer to “the agent lost track of what world it is operating in.”

Category 3, impossible scenarios, is also interesting. The agent correctly identifies all impossible cases in this evaluation, with relatively low average token use compared with the constrained and partially observable categories. This does not mean the agent has a perfect impossibility detector in general. The paper evaluates 10 scenarios in that category, each executed once. But it does show that the benchmark can include tasks where the right answer is not a plan. For industrial use, that is important. Sometimes the correct operational response is “this cannot be done under current constraints,” not “here is a confident 47-step disaster.”

What the paper directly shows

The paper directly demonstrates three things.

First, it shows that an executable Blocksworld environment can be exposed through MCP so that LLM agents can interact with it using a standardized tool interface. This is an implementation and benchmark-design contribution. It supports comparison across agent architectures because the simulation and tool server can remain fixed while the agent changes.

Second, it shows that the benchmark can measure more than final success. The reported metrics include success rate, execution time, planning attempts, input tokens, output tokens, and dollar cost. That turns agent evaluation from a pass/fail toy into a small performance profile. For business evaluation, that matters because a 70% success rate at 76 seconds and a 70% success rate at 732 seconds are not the same product.

Third, it provides a baseline for one o3-based ReAct agent. The baseline performs reasonably on basic and impossible categories but degrades under non-constructive actions, additional constraints, and partial observability. The strongest practical finding is not one number. It is the multi-metric pattern: more constraints and less visibility create lower reliability, higher latency, more retries, higher token consumption, and new tool-use errors.

That is enough to justify the benchmark as a diagnostic environment. It is not enough to rank all agent architectures, crown a model winner, or claim readiness for factory deployment. Thankfully, the paper does not try to sell that fantasy. The internet has enough unpaid interns doing that already.

What Cognaptus infers for business use

The business implication is not “use Blocksworld before buying robots.” That would be adorable and insufficient.

The better inference is that agent pilots need executable evaluation environments before they touch live operations. If a company wants to deploy LLM agents for scheduling, warehouse orchestration, robotic task planning, maintenance workflows, or production-line assistance, the useful question is not whether the model can produce a plausible plan in a chat window. The useful question is whether it can survive a controlled operational loop.

A practical evaluation pipeline would look like this:

Define a simplified but executable simulator of the operational domain.
Expose core actions through a standardized tool interface.
Include information tools, verification tools, execution tools, and post-action state queries.
Create scenario categories that isolate real complexity dimensions: constraints, hidden state, impossible goals, non-constructive actions, long-horizon dependencies.
Measure success, time, retries, token use, cost, invalid tool calls, premature termination, and false impossibility claims.
Compare agent architectures under identical scenarios before introducing real assets.

This is not glamorous. It is the agentic version of test benches and dry runs. Which is exactly why it is useful.

For ROI analysis, the benchmark’s cost and token metrics are especially relevant. A pilot that only reports success rate hides operational economics. If harder tasks require three times the planning attempts and five times the token budget, the system may still be useful—but only in workflows where latency is tolerable and automation value exceeds reasoning cost. In a high-frequency control loop, that is fatal. In a low-frequency planning assistant, it may be acceptable.

The paper also nudges agent design away from the “one smart model with tools” approach. The authors explicitly identify future comparisons with single-agent, hierarchical multi-agent, and hybrid LLM-symbolic approaches. That direction makes sense. When constraints are formal and actions are discrete, symbolic planners are not obsolete relics. They are useful machinery. The LLM may be best used to translate vague goals, interpret feedback, negotiate ambiguity, or coordinate high-level reasoning, while formal components handle search, validation, and guarantees.

In plain English: do not make the poet operate the crane alone.

A small benchmark with large design consequences

The benchmark’s scenario set is modest: 50 predefined scenarios, 10 per category. Scenarios range from 3 to 20 blocks, known solution lengths from about 4 to 80 steps, up to 60 non-constructive actions, up to 10 misplaced blocks, and 3 to 6 stack positions. Each scenario is executed once.

That design is enough to demonstrate applicability and reveal failure modes. It is not enough for statistical certainty about model performance. A serious enterprise evaluation should repeat runs, vary prompts, include multiple model snapshots, compare architectures, and test distributions beyond the original scenarios. Agent performance can be sensitive to phrasing, tool descriptions, retry logic, and orchestration design. One run per scenario is a baseline, not a verdict.

The paper’s own future-work section points in the right direction: compare different architectures, including hierarchical multi-agent and hybrid approaches; add dynamic events such as runtime errors, changing goals, and physical disturbances; include richer constraints such as weight, material properties, and stacking heights; and extend toward multi-robot coordination.

Those extensions matter because real operations are not static Blocksworld. Machines fail. Goals change. Sensors lie by omission. Two robots want the same aisle. A part exists in the ERP system but not on the shelf, because reality has a dark sense of humor.

But the current benchmark does not need to solve all of that to be useful. Its value is in showing how to move from text-only planning evaluation toward executable agent evaluation with measurable control-loop behavior.

The misconception to retire: plausible plans are not operational competence

The likely reader misconception is simple: once a reasoning-capable LLM can use tools and generate a plausible plan, it is close to operational autonomy.

The paper’s baseline says: not so fast.

A plausible plan is only one artifact inside a larger control process. Operational competence requires at least five abilities: understanding current rules, maintaining state, generating valid plans, repairing invalid plans, and executing actions without drifting from the real environment. When partial observability appears, the agent also needs disciplined information gathering and uncertainty handling. When constraints appear, it needs formal compliance. When goals are impossible, it needs refusal rather than improvisation.

This is where demos mislead. Demos often show a successful run. Benchmarks show the shape of failure across categories. The difference is not academic pedantry. It is the difference between “our agent worked once” and “we know where the agent breaks, how expensively it breaks, and which architecture might reduce the breakage.”

For executives, the practical lesson is equally blunt: do not evaluate agent systems only by final task completion. Evaluate degradation curves. Ask what happens when the task requires temporary backtracking, when the state is partially hidden, when constraints invalidate natural plans, and when the correct answer is that no valid plan exists.

If the evaluation does not include those cases, it is not an evaluation. It is a dress rehearsal for embarrassment.

Where this applies, and where it does not

The paper supports benchmark design and risk framing for LLM agents in planning-and-control contexts. It does not prove that LLM agents are ready for industrial automation, nor does it prove that they are unsuitable. It shows that a strong single-agent baseline can be systematically evaluated in an executable environment and that performance degrades in interpretable ways as complexity increases.

The evidence is symbolic, not physical. Blocks do not slip. Sensors do not drift. Actuators do not wear out. Human supervisors do not interrupt with contradictory instructions. The benchmark’s partial observability is a controlled abstraction, not a full perception stack. Its additional constraints are simple compared with industrial safety and engineering constraints.

Also, the baseline is only one agent architecture. The agent is prompt-guided rather than governed by hard architectural flow control. A more structured planner, a hybrid symbolic system, or a multi-agent architecture might behave differently. That is precisely why the benchmark is useful: it gives future systems something consistent to run against.

So the boundary is clear. Do not read the paper as a production-readiness certificate. Read it as a measurement template.

The useful future is boring in the best way

The most promising direction suggested by this work is not an all-knowing autonomous agent improvising its way through the factory. It is a boring, layered, measurable system:

\ast LLMs interpret goals and handle ambiguity. \ast Formal planners search constrained action spaces. \ast Simulators verify before execution. \ast Tool servers expose actions through standard interfaces. \ast Runtime monitors catch invalid calls and state drift. \ast Evaluation suites measure not only success, but cost, retries, latency, and failure class.

That is not as cinematic as “AI agent runs the factory.” It is also much closer to something one could responsibly deploy.

Blocksworld still breaks your fancy LLM agent because planning is not the same as speaking, tool use is not the same as control, and verification is not the same as reliability. The paper’s contribution is to make those distinctions executable.

The blocks are small. The lesson is not.

\ast\astCognaptus: Automate the Present, Incubate the Future.\ast\ast

Niklas Jobs, Luis Miguel Vieira da Silva, Jayanth Somashekaraiah, Maximilian Weigand, David Kube, and Felix Gehlhoff, “Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol,” arXiv:2512.03955, 2025. https://arxiv.org/abs/2512.03955 ↩︎

The baseline result is the argument#

Blocksworld is not childish; it is diagnostic#

The benchmark’s real contribution is executable evaluation, not another static puzzle set#

Verification helps, but it does not make the agent safe#

The five categories expose different failure economics#

What the paper directly shows#

What Cognaptus infers for business use#

A small benchmark with large design consequences#

The misconception to retire: plausible plans are not operational competence#

Where this applies, and where it does not#

The useful future is boring in the best way#