Planning

Lost in the Links: When World Knowledge Isn’t Enough

Links look harmless. One click from one Wikipedia page to another. Then another. Then another. No robotics. No messy browser UI. No customer database. No procurement workflow with three inconsistent Excel files and one person named Mike who “usually knows where that form is.” Just hyperlinks. That is why LLM-WikiRace is useful. It strips agentic AI down to a small, irritating question: when a model knows a lot about the world, can it use that knowledge step by step without getting lost?1 ...

Metric Time Without the Clock: Making ASP Scale Again

Calendars are harmless until a computer has to reason about them. A human can say, “Ram has a dentist appointment in one hour, must pick up his insurance card from home, needs cash from the ATM, and travel takes 15, 20, 30, or 40 minutes depending on the route.” We see a small planning problem. A logic system sees actions, states, deadlines, durations, inertia, and a very annoying question: should every possible minute become a Boolean object? ...

SokoBench: When Reasoning Models Lose the Plot

A corridor is not supposed to be hard. There is one player. One box. One goal. No maze. No clever trap. No branching strategy tree with a thousand tempting wrong turns. The player stands at one end, the goal sits at the other, and the box is between them. Push the box along the corridor until it reaches the goal. That is the task. ...

Houston, We Have a Benchmark: When Agentic AI Meets Orbital Reality

Space is not impressed by fluent reasoning. A satellite does not care that an AI agent has produced a confident plan. A ground station cannot magically see through the Earth because the prompt says “ensure connectivity.” A sensor cannot keep collecting images after its onboard storage is full. Orbital mechanics, power budgets, slew angles, data buffers, and line-of-sight geometry are not stakeholder preferences. They are constraints. Reality, annoyingly, still has root access. ...

Deployed, Retrained, Repeated: When LLMs Learn From Being Used

Acceptance is a reward, even when nobody writes reward = 1. Imagine an enterprise deploys an AI agent to generate code, reconcile invoices, or prepare operational plans. Some outputs pass automated checks and enter production. Others fail, disappear into logs, and are never seen again. Months later, the accepted outputs are collected and used to fine-tune the next model. ...

Model First, Think Later: Why LLMs Fail Before They Reason

The schedule looked reasonable. That was the problem. Imagine asking an AI agent to build a weekly medical schedule. It produces a neat plan. The steps are numbered. The tone is confident. The explanation is calm enough to sedate a committee. Then someone checks the details. A medication interval is violated. A resource is assigned twice. A prerequisite appears after the action that depends on it. Nothing looks absurd sentence by sentence, but the plan is broken as a system. ...

Stacking the Odds: Why Blocksworld Still Breaks Your Fancy LLM Agent

A robot arm, a few colored blocks, and a table. That is the setup. No messy warehouse, no sensor dust, no tired operator, no forklift reversing into the wrong aisle. Just blocks. And still, the fancy LLM agent stumbles. That is the useful discomfort in Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol.1 The paper does not show a robot revolution. It shows something more valuable for anyone trying to deploy LLM agents in industrial workflows: even in a symbolic world where the rules are explicit, the actions are discrete, the state can be queried, and the tool interface is standardized, reliability degrades as soon as the task stops being politely simple. ...

Tile by Tile: Why LLMs Still Can't Plan Their Way Out of a 3×3 Box

A board game should not embarrass a frontier model. That is the uncomfortable charm of the 8-puzzle. It has no hidden information, no vague user intent, no messy database schema, no ambiguous policy exception, and no client saying “just make it pop.” It is a 3×3 grid with eight tiles and one blank space. Slide adjacent tiles into the blank. Reach the goal state. Done. ...

Thresholds, Trade-offs, and the Art of Not Overthinking Your Robot

A robot pauses in front of a table. There is a block, a can, a box, and something that is either on top of something else or merely enjoying a close and misleading friendship. A camera sends pixels. A perception model sends predictions. A planner wants a symbolic fact: On(A, B) or not. The expensive mistake is pretending that this last step is clean. ...

Replan, Rethink, Repeat: Why Vision-Language Models Make Better Closed‑Loop Planners

Robots are very good at making small mistakes expensive. A misplaced cup is not just a misplaced cup. It can block the next object. A wrong order can violate a task constraint. A slightly bad coordinate can turn an elegant plan into a collision check failure. In software, you can often patch around the mistake and pretend this was always the architecture. In robotics, physics has a less forgiving product-management style. ...