Opening — Why this matters now
Industrial AI is undergoing a personality crisis. On one hand, we have factories that desperately want adaptable decision-making. On the other, we have Large Language Models—brilliant at essays, somewhat less convincing at not toppling virtual block towers. As vendors race to bolt LLMs into automation stacks, a familiar problem resurfaces: everyone claims to have an “agent,” yet no one can compare them meaningfully.
Enter Blocksworld—a symbolic sandbox old enough to predate some programming languages—reborn as a rigorous, MCP-enabled benchmark. The irony is delightful: to evaluate cutting-edge agents, we return to a children’s puzzle about neatly stacking blocks.
Background — Context and prior art
LLM-based agents are supposed to plan, correct themselves, and execute tasks while interacting with structured environments. The problem is that existing benchmarks rarely demand all three at once. Most test either pure planning (static datasets) or pure tool-use (MCP servers without rich dynamics).
Classic Blocksworld benchmarks—PlanBench, Sys2Bench—evaluate planning ability but limit adaptability. Meanwhile, MCP-based testbeds evaluate tool invocation but typically avoid deeper reasoning about state, constraints, and consequences.
The paper “Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol” bridges these worlds by:
- Providing a live executable simulation rather than static samples.
- Adding five structured complexity categories.
- Integrating the Model Context Protocol (MCP) so agents can discover and invoke tools dynamically.
- Introducing constraints, partial observability, and non-constructive actions that push LLMs well beyond pattern-matching.
It’s a testbed where your agent must think, act, fail, rethink, and act again—much like an engineer on a night shift.
Analysis — What the paper actually does
The benchmark introduces:
1. A modular architecture
The environment exposes all capabilities through a REST API, wrapped as MCP tools grouped as:
- Information tools (rules, state)
- Verification tools (check action sequences without execution)
- Execution tools (pick up, stack, unstack, put down)
This allows different agent architectures—ReAct, multi-agent orchestrators, symbolic hybrids—to plug in cleanly without custom glue code.
2. Five scenario categories
Complexity increases along four dimensions: number of blocks, non-constructive actions, domain constraints, and observability limits.
A summary:
| Category | Description | Main Difficulty |
|---|---|---|
| 1 | Basic tasks | Straight-line constructive planning |
| 2 | Includes non-constructive actions | Must temporarily worsen state |
| 3 | Impossible scenarios | Identify unsatisfiable goals |
| 4 | Extra constraints (e.g., block size rules) | Reduced action set; more dead ends |
| 5 | Partial observability | Must reason about hidden information |
3. A single-agent demonstration using OpenAI o3
The authors built a ReAct-style agent that:
- Queries rules
- Inspects current state
- Generates a plan
- Verifies the plan
- Repairs and re-verifies
- Executes actions via MCP
- Confirms goal satisfaction
Think of it as an LLM wrapped inside a flowchart taped together with hope.
The evaluation covered 50 scenarios—10 per category—with metrics including success rate, tokens consumed, execution time, and cost.
Findings — What actually happened
LLM agents absolutely thrive… until they don’t.
Performance Overview
| Metric | C1 | C2 | C3 | C4 | C5 |
|---|---|---|---|---|---|
| Success Rate | 80% | 70% | 100% | 70% | 60% |
| Avg Time (s) | 76 | 290 | 125 | 732 | 676 |
| Avg Attempts | 1.1 | 1.7 | 1.8 | 2.2 | 3.1 |
| Input Tokens | 33k | 99k | 10k | 109k | 151k |
| Output Tokens | 1.7k | 12k | 7.9k | 35k | 41k |
| Avg Cost | $0.08 | $0.30 | $0.08 | $0.50 | $0.63 |
Interpreting the numbers
-
Category 1 is trivial: the agent is essentially rearranging digital Lego blocks.
-
Category 2 exposes fragility: non-constructive actions (e.g., moving a correct block to free a buried one) substantially confuse LLMs because they contradict local optimality heuristics.
-
Category 3 (impossible) is oddly where the agent is most decisive—unsolvable problems are easier to recognize than solvable-but-messy ones.
-
Category 4 and 5 become brutal:
- Block size restrictions prune many valid actions, requiring deeper combinatorial reasoning.
- Partial observability introduces hidden-state inference—a skill LLMs famously improvise with varying success.
Common failure modes
- Misidentifying solvable problems as unsolvable.
- Generating valid plans but executing them incorrectly.
- Forgetting block names or referencing nonexistent ones.
- Producing locally reasonable but globally incoherent sequences.
In other words, LLM agents are not planners—they are reasoning performers whose confidence sometimes exceeds their competence.
Implications — Why this matters for business and industry
1. LLM agents are not ready to be left unsupervised
If an agent struggles to stack blocks consistently, it should not yet orchestrate multi-robot manufacturing lines.
2. MCP-based tool discovery is promising
The benchmark shows that LLMs can learn environment rules dynamically rather than via brittle, hard-coded assumptions—critical for real, evolving factories.
3. Benchmarks must involve execution, not just planning
Static tasks reward memorization. Real automation rewards adaptability, validation, and recovery.
4. Hybrid architectures will win
The results quietly endorse a design principle: combine LLM flexibility with symbolic or algorithmic stability. LLMs alone remain unreliable for long-horizon, constraint-heavy planning.
5. Cost matters
A single moderately difficult scenario costs up to $0.63. Multiply that across hundreds of tasks in a real system, and you get a compelling reason for model distillation, caching, and hybrid reasoning.
Conclusion
Blocksworld’s revival is not nostalgia—it’s necessity. By forcing LLM agents to plan and execute inside a controlled symbolic domain, this benchmark exposes the exact fault lines that industrial adopters must address.
The future lies in agents that:
- Sense uncertainty
- Validate their own reasoning
- Adapt plans dynamically
- Use LLMs where helpful, not where fragile
Until then, even the humble block tower remains a worthy adversary.
Cognaptus: Automate the Present, Incubate the Future.