Robots are very good at making small mistakes expensive.
A misplaced cup is not just a misplaced cup. It can block the next object. A wrong order can violate a task constraint. A slightly bad coordinate can turn an elegant plan into a collision check failure. In software, you can often patch around the mistake and pretend this was always the architecture. In robotics, physics has a less forgiving product-management style.
That is why high-level robot planning with language and vision-language models has always had an awkward gap between promise and operational use. A model may produce a plausible task plan: pick this, place that, move the fruit before the can, avoid blocking the basket. But plausibility is not execution. The real question is not whether a VLM can generate a plan once. It is whether the system can notice when the plan has become wrong and recover before the error becomes a tiny warehouse drama.
The paper behind this article, Using Language Models as Closed-Loop High-Level Planners for Robotics Applications: A Brief Overview and Benchmarks, studies exactly that problem.1 Its useful contribution is not another ritual announcement that language models are “promising” for robotics. We have enough promising things. The paper asks a more operational question: if a VLM is used as a high-level planner, should it plan once, replan often, or replan with memory of what it already tried?
The answer is comparison-based, and conveniently so. Open-loop loses to closed-loop. Shorter control horizons are not automatically better. Warm-starting is usually the difference between controlled replanning and repeatedly asking a forgetful intern to start over from scratch.
Subtle? Yes. Useful? Also yes. Annoying for anyone hoping that “just query the model more often” was a robotics strategy? Very.
The paper treats VLM planning as MPC, not magic
The authors frame closed-loop high-level planning through the lens of model predictive control. In traditional MPC, a controller repeatedly solves an optimisation problem, executes part of the plan, observes the updated state, then solves again. The paper maps this pattern onto language-model-based robot planning: the prompt provides the task objective and constraints, the model’s internal world knowledge acts as a rough dynamics prior, and the robot executes one or more proposed actions before the model replans.
That framing matters because it turns an airy “LLM agent” story into a set of design variables. The paper focuses on three:
| Design question | Experimental comparison | Why it matters operationally |
|---|---|---|
| Should the planner be open-loop or closed-loop? | Generate one full plan versus allow replanning after execution feedback | Tests whether feedback helps even when the environment itself is static |
| How often should the planner replan? | Short, half, and full control horizons | Tests whether more frequent model calls actually improve reliability |
| Should replanning include the previous plan and execution status? | Warm-started versus non-warm-started closed-loop planners | Tests whether continuity helps or merely anchors the model to earlier mistakes |
This is the right structure for the article because the paper itself is not one big claim. It is a sequence of contrasts. Each contrast removes one lazy assumption from the robotics-with-VLMs discussion.
The experimental setting is deliberately controlled. The planners operate in four tabletop manipulation environments: Cube-Easy, YCB-Easy, YCB-Medium, and YCB-Hard. The tasks involve rearranging objects using two primitives: pick(object) and place(object, location). The harder settings require not only putting objects in the right physical region, but also satisfying logical ordering constraints. In YCB-Hard, for example, the robot is making a fruit salad while reasoning about descriptors such as sour fruit and about which ingredients should be used or put away.
The authors evaluate three VLMs: GPT-4.1-mini, Gemini-2.5-flash, and Llama-4-Maverick-17B. Each planner is tested over 50 randomised initial conditions per environment. The metrics separate geometry from task success: Goal Achieved Rate measures whether all objects are placed successfully regardless of logical constraints, while Task Completion Rate requires both successful placement and satisfaction of logical constraints. This distinction becomes important because the paper repeatedly shows that fixing physical execution errors is easier than fixing bad task logic.
Closed-loop helps even when the world is not changing
The first comparison is the most intuitive: open-loop versus closed-loop.
An open-loop planner generates a full plan once and then lives with it. A closed-loop planner can regenerate a plan after some actions have executed or after an execution failure. In many robotics contexts, closed-loop planning is obviously useful because the environment changes. People move objects. Sensors misread scenes. Low-level execution goes sideways. Reality, being rude, refuses to remain a benchmark.
The interesting part is that this paper uses static environments. Objects remain stationary unless the robot directly manipulates them. No new objects appear. This setup intentionally removes the easiest argument for closed-loop planning. If closed-loop still helps here, the reason is not environmental chaos. It is model fallibility.
That is exactly what the paper finds. Across the 12 scenarios formed by four environments and three VLMs, the open-loop planner never reaches a 100% Goal Achieved Rate. The closed-loop-full planner improves Goal Achieved Rate in all 12 scenarios, with an average improvement of 21.7%; seven of the 12 improvements are statistically significant.
This is the paper’s first business-relevant lesson: feedback is not only for unpredictable environments. Feedback is also a remedy for the planner’s own bad first draft.
That distinction matters. A robotics team might assume that closed-loop orchestration becomes essential only in messy environments: warehouses, kitchens, hospitals, construction sites, or anywhere humans insist on existing. But the paper shows a narrower and more fundamental point. Even when the scene is static, the VLM may choose a poor placement coordinate, generate an invalid sequence, or create a plan that looks semantically fine but fails under collision checks. Closed-loop planning gives the system another opportunity to recover.
But the improvement is uneven. Closed-loop-full improves geometric success more clearly than logical success. The closed-loop planner has higher Correct Final Logical Plan Rate in only one of 12 scenarios, while open-loop is better in three, and none of those logical comparisons are statistically significant. In several cases, the closed-loop planner improves Goal Achieved Rate while Task Completion Rate remains similar.
That is not a contradiction. It is the point.
The closed-loop planner can often recover from a failed placement: move the object somewhere else, try a corrected coordinate, avoid a collision. But if the plan violates an ordering constraint, one replanning opportunity may not be enough. Once a logically wrong step has already been executed, the system may not have an easy undo path. The robot can move objects. It cannot casually unmake the fact that it placed the wrong object first, unless the domain has explicit recovery actions. This is where physical and logical error recovery part ways.
The practical reading is simple: closed-loop planning is a reliability layer, not a morality upgrade for the model. It helps the system correct some errors. It does not automatically make the model understand the task better.
More replanning is not the same as more reliability
The obvious next assumption is that if closed-loop helps, more closed-loop must help more. Replan after every action. Ask the model constantly. Keep the system maximally reactive. Surely that is safer.
The paper politely ruins this instinct.
The authors compare three control-horizon settings for closed-loop planning. In the short-horizon condition, the planner replans every two steps. In the half-horizon condition, it replans every $k/2$ steps, where $k$ is the minimum number of primitive actions required for the task. In the full-horizon condition, it replans every $k$ steps, effectively replanning only after the task-length segment or an execution failure.
This is best read as a sensitivity test on replanning frequency. It asks whether control horizon is a dominant performance driver once the system is already closed-loop.
The result: shorter is often numerically better, but not reliably enough to treat as a rule. CL-Short achieves the best Task Completion Rate in 10 of 12 scenarios, but only two of those results are statistically significant. For Goal Achieved Rate, CL-Short is best in six scenarios, again with only two statistically significant results. For Correct Final Logical Plan Rate, CL-Short is best in nine of 12 scenarios, but only one is statistically significant. Positive and negative logical correction metrics show no consistent winner.
The paper’s explanation is more useful than the leaderboard. A shorter control horizon gives the model more chances to correct the plan. It also gives the model more chances to introduce a new error. This is the central misconception the article needs to kill: replanning frequency is not a pure good. Every model call is both an opportunity and a liability.
That matters for business deployment because VLM calls are not free, and not merely in the token-cost sense. More calls mean more latency, more parsing exposure, more plan variation, more monitoring burden, and more surface area for brittle behaviour. If a team designs a robotic system around “replan whenever possible,” it may buy reactivity while accidentally selling off consistency.
The better rule is task-dependent reactivity. A dynamic environment may need short horizons because the world is moving faster than the plan. A static or semi-structured workflow may not. The paper’s static setting is important here: because dynamic environmental changes are excluded, the control-horizon result isolates model behaviour rather than environmental volatility. In a moving warehouse or shared kitchen, shorter horizons may become more attractive. But that is an argument from deployment context, not from the paper’s evidence alone.
The responsible inference is therefore not “use long horizons.” It is “do not confuse maximal replanning with optimal replanning.” Robotics systems need a control policy for model calls, not a nervous habit.
Warm-starting turns replanning into continuity instead of amnesia
The strongest operational result in the paper concerns warm-starting.
In this setup, warm-starting means that when the closed-loop planner replans, it receives the previously generated plan and the execution statuses of primitive actions from that plan. The non-warm-started version receives the current state and task description but does not get that continuity information. It starts again.
This is an ablation, and it is the most business-relevant one. It tests whether the value of closed-loop planning comes merely from re-querying the VLM with a fresh state, or from preserving plan history and execution feedback.
Warm-starting wins clearly. Across 24 scenarios—four environments, three VLMs, and two closed-loop horizon settings—the warm-started variants achieve better Task Completion Rate in 21 scenarios. Thirteen of those 21 improvements are statistically significant, and the average improvement is 28.2%. For Goal Achieved Rate, warm-starting is better in 20 scenarios, with 14 statistically significant results and an average improvement of 31.7%.
That is not a small implementation detail. It is architecture.
Without warm-starting, every replanning step behaves more like an independent attempt. The model may forget why a previous action failed. It may propose an incompatible continuation. It may restart the task logic from the wrong implicit state. As the authors note, the chance of stringing together successful primitive actions can decrease geometrically when each replanning attempt is independent. The longer the task and the more frequent the replanning, the worse that problem becomes. In some scenarios, non-warm-starting collapses entirely.
Warm-starting does something more mundane and more valuable: it gives the planner continuity. Not intelligence in the grand theatrical sense. Just continuity. A list of what was attempted. Which actions executed. Which failed. What the old plan looked like. This is the robotics equivalent of not making every shift worker rediscover the warehouse layout at 9 a.m.
The logical results are more nuanced. Warm-starting reduces negative logical corrections in 21 of 24 scenarios, with an average decrease of 7.0%; 14 of those results are statistically significant. That means warm-starting helps prevent the model from making the plan logically worse as it replans. Good. Adults in the room appreciate “less worse.”
But warm-starting also makes the planner less likely to make positive logical corrections. In 16 of 24 scenarios, non-warm-started planners have higher Positive Logical Correction Rate, with six statistically significant results. The mechanism is plausible: warm-starting biases the model toward the previous plan, including the previous plan’s mistakes. It stabilises the planner, but stability can become inertia.
That trade-off is the paper’s most useful design tension. Warm-starting is not simply “memory good.” It is memory with anchoring. It reduces drift, but it can preserve wrong assumptions. For enterprise robotics teams, this suggests that warm-starting should probably be paired with explicit critique, constraint checking, or verifier modules. The model should see the old plan, yes. It should also be forced to identify where that old plan may be invalid. Otherwise, the system risks becoming impressively consistent about being wrong.
A familiar corporate trait, unfortunately.
What each experiment supports—and what it does not
The paper’s evidence is strongest when read by test purpose rather than as one undifferentiated benchmark.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Open-loop vs closed-loop-full | Main evidence | Feedback improves geometric goal achievement even in static scenes | Closed-loop automatically fixes logical planning failures |
| Short vs half vs full control horizon | Sensitivity test on replanning frequency | More frequent replanning is not reliably superior | Long horizons are generally better in dynamic settings |
| Warm-start vs no warm-start | Ablation on continuity and execution feedback | Replanning works better when the model sees prior plans and execution status | Warm-starting always improves logical correction |
| Appendix tables across models and environments | Robustness/detail layer | Effects vary by VLM, environment, and metric | One model ranking that will stay valid as VLMs change |
| Prompt listings | Implementation detail | Necessary task, constraint, embodiment, and coordinate information were supplied | VLMs can infer spatial state robustly from images alone |
This separation matters because robotics papers often tempt readers into flattening all results into one headline. Here, the headline should not be “VLMs are good robot planners.” The evidence is more precise: VLM-based high-level planners become more useful when wrapped in a closed-loop structure, especially when replanning is warm-started with execution feedback.
That is an orchestration result, not a foundation-model victory lap.
The business value is planner governance, not robot autonomy theatre
For businesses considering VLMs in robotics, the paper points toward a practical architecture pattern:
- Use the VLM as a high-level planner, not as the whole robot brain.
- Keep low-level motion planning and collision checking separate.
- Replan from execution feedback rather than trusting the first plan.
- Preserve prior plan context and action status when replanning.
- Tune the control horizon to the task’s need for reactivity.
- Add verification around logical constraints, because replanning alone may not repair them.
The paper directly shows the first four points in its benchmark setting. The fifth follows from the control-horizon results: shorter horizons are not reliably better in static scenes, so the right horizon is a systems decision. The sixth is a Cognaptus inference from the logical-metric pattern: when Task Completion Rate fails to rise with Goal Achieved Rate, the bottleneck is not simply object placement. It is constraint satisfaction.
This has procurement and product implications. If a vendor claims that a VLM-based robot system is reliable because it “replans continuously,” the correct response is not applause. It is a questionnaire.
Does the replanning include execution history? Are previous failed actions shown to the model? Are logical constraints checked outside the model? What happens when the model’s new plan contradicts an already executed step? How is the control horizon selected? Is the planner allowed to generate fresh errors at every cycle, or is there a stabilising mechanism?
Those are not academic niceties. They are the difference between a demo that recovers gracefully and an operation that creates a support ticket every time the tomato soup can enters the basket at the wrong moment.
The boundaries are narrow, and that is useful
The paper’s limitations are not decorative. They materially define how far the result should travel.
First, the experiments use zero-shot VLM planners. Fine-tuned, task-specialised, or tool-augmented models may behave differently. Second, the environments are static tabletop manipulation tasks. This was a deliberate experimental choice, but it means the control-horizon findings should not be overextended to highly dynamic environments. Third, the action set is limited to pick and place. The authors expect the observations to generalise to other primitives because the VLM planner treats actions textually, but that remains an expectation, not a demonstrated deployment guarantee.
Another boundary is especially important: the benchmarked VLMs were given explicit object location information. The authors state that the tested VLMs did not demonstrate the necessary capability for object position estimation and reasoning from images alone. That is a quiet but important warning. The “vision-language” in VLM does not mean the model can safely replace perception, state estimation, geometric reasoning, and motion planning. The system still needs structured spatial information and classical machinery around the model.
This is not a weakness of the paper. It is what makes the paper practically useful. It avoids pretending that VLMs are general-purpose robot minds and instead studies how to use them as one component inside a planning loop.
Replanning is a system design problem
The best lesson from this paper is that reliability does not come from asking a smarter model to produce a perfect plan. It comes from designing the loop around an imperfect planner.
Open-loop planning asks the model to be right once and stay right. Closed-loop planning admits that this is a bad bargain. Shorter horizons create more opportunities for recovery, but also more opportunities for fresh mistakes. Warm-starting preserves continuity, improves completion, and reduces harmful logical drift, but can anchor the model to earlier errors.
So the operational thesis is not “VLMs make robots autonomous.” That is the kind of sentence that should be handled with tongs.
The better thesis is narrower and stronger: VLMs can be useful high-level planners when they are embedded in closed-loop systems that feed back execution status, preserve prior planning context, and treat replanning frequency as a controlled parameter rather than a religious preference.
For robotics businesses, that is the difference between buying intelligence and engineering reliability. One is a procurement fantasy. The other has a chance of surviving contact with a basket, a plate, and one stubborn lemon.
Cognaptus: Automate the Present, Incubate the Future.
-
Hao Wang, Sathwik Karnik, Beatrice Lim, and Somil Bansal, “Using Language Models as Closed-Loop High-Level Planners for Robotics Applications: A Brief Overview and Benchmarks,” arXiv:2511.07410v2, 2026, https://arxiv.org/abs/2511.07410. ↩︎