When AI Knows the Map but Gets Lost on the Journey

Workflow demos are usually polite. They show the agent reading a request, calling a tool, checking a result, and producing an answer before anything embarrassing has time to happen.

The real test begins later. Not at step three. At step twenty-seven, when a previous decision constrains the next one, a small drift compounds, and the system must still remember what “done correctly” means. This is where many AI products discover that knowing the rule is not the same as applying it repeatedly without wobbling. A charming discovery, preferably not made inside a production accounting workflow.

A recent paper, “Generalization in LLM Problem Solving: The Case of the Shortest Path,” studies this problem with a useful level of discipline.¹ Instead of asking whether a model performs well on another messy benchmark, the authors construct a controlled shortest-path environment. A model is trained to output an optimal route between two nodes on a map. The task is simple enough to measure cleanly, but structured enough to expose a deeper issue in language-model problem solving: models can transfer a learned rule to new environments, yet still fail when the same rule must be applied for a longer horizon.

That is the commercial lesson. AI systems may generalize across where a task happens before they generalize across how long the task runs.

The useful contrast is not “can it reason?” but “which kind of generalization failed?”

Most business discussions about AI agents use a single word—“reasoning”—to cover several different abilities. That saves time in meetings and destroys precision everywhere else.

The paper separates two axes that are often blended together:

Capability	What it asks	Business analogue	Why it matters
Spatial transfer	Can the model solve the same type of problem on a new map?	Can an automation trained on one department, document style, or account structure work in another?	Tests whether the model learned a reusable rule rather than memorizing local examples.
Length scaling	Can the model solve paths longer than those seen in training?	Can an agent handle a longer workflow, more dependencies, or more sequential decisions?	Tests whether local competence remains stable when recursively applied.

This distinction looks technical, but it changes how we should evaluate automation. A system can pass the first test and fail the second. That is exactly what the paper finds.

In the main experiment, models achieve strong spatial transfer: when evaluated on disjoint maps within the training-length regime, success rates are above 90%. The models have not merely memorized node sequences from one map. They appear to learn reusable navigation behavior that can be applied elsewhere.

Then the path gets longer.

Once the shortest path exceeds the maximum length seen during training, success deteriorates sharply. The deterioration appears both on the original map and on unseen maps. So the bottleneck is not simply “new environment.” The model can handle new environments. It struggles with longer recursive execution.

This is the first useful correction for product thinking: cross-domain transfer and long-horizon reliability are separate tests. Passing one does not grant a certificate for the other. The market enjoys certificates. The model does not care.

The failure is recursive instability, not merely harder subproblems

A natural objection is that longer paths are just harder. If the model is imperfect on short segments, then a longer path contains more chances to fail. That would be boring but understandable: more steps, more error.

The paper tests this directly. For long paths, the authors split the target route into shorter subpaths that fall within the training-length regime. Then they ask whether the model can solve those subparts and whether it can solve the full path when the subparts are individually solvable.

This decomposition matters because it separates two mechanisms:

Possible mechanism	Meaning	Practical interpretation
Hardness accumulation	Long tasks fail because they contain more local subproblems, and each local step has some failure probability.	Improve local accuracy and reduce per-step error.
Recursive instability	Even when the smaller pieces are solvable, the model fails to compose them into the longer solution.	Add checkpoints, decomposition, state tracking, or curriculum exposure to longer horizons.

The evidence points mainly to recursive instability. The model’s subpath performance remains high, but full-path success drops much more. In the paper’s composition analysis, the recursive-stability term falls substantially across longer path groups, while the residual error term remains small. The meaning is simple: the model has many of the local moves, but it cannot reliably keep the whole journey coherent as the journey extends.

This distinction is not academic ornament. In enterprise automation, the wrong diagnosis leads to the wrong spending. If the failure is local competence, buy better examples for the weak local operation. If the failure is recursive instability, more short examples may be a slow and expensive way to avoid admitting that the system needs better decomposition, verification, and horizon-aware training.

A customer-support agent that handles one refund correctly may still mishandle a refund that depends on a prior shipment correction, a warranty exception, and a region-specific approval rule. Each subtask may be familiar. The combined chain may still drift.

The paper gives us a cleaner vocabulary for that failure. The model does not necessarily “not understand.” It may understand locally and unravel sequentially. That is less comforting than it sounds.

Coverage beats theatrical diversity, but only after minimal diversity exists

The paper then sharpens the data question by separating coverage from diversity.

In the map setting, coverage means how many unique nodes—the primitive elements of the local training world—appear in the training questions. Diversity means how richly those primitives are paired and recombined into different start–end relationships.

This gives a more useful data-design framework than the usual vague appeal to “diverse data.” Diversity is not one thing. It has types. Some types expand the primitive base. Others recombine the same primitive base. These are not equivalent.

The paper’s finding is structured:

Data property	What the paper finds	Business translation
Low coverage	Cannot be rescued even by very high diversity.	Recombining a narrow set of cases does not teach a broad skill.
Minimal diversity	Required to unlock the value of coverage.	The model needs some variation in how concepts connect.
Excessive diversity at low coverage	Can hurt transfer.	Exhaustively remixing a tiny case set may encourage memorization.
Mid-to-high coverage with modest diversity	Best efficiency–performance trade-off.	Cover more task types first; add enough variation, not infinite theater.

The appendix provides a helpful robustness-style extension of this point. At very low coverage, even exponentially high diversity raises success only weakly. At higher coverage, diversity amplifies performance much more effectively. The operational lesson is not “ignore diversity.” It is “do not ask diversity to compensate for missing coverage.”

This is a common enterprise mistake. A company may generate many synthetic variations of the same few workflow examples and call the dataset diverse. It is diverse in language surface, not in operational coverage. The model sees many costumes, one plot.

The paper’s MathQA case study supports the same direction outside the synthetic map world. The authors fine-tune Qwen2.5-7B-Instruct on three MathQA categories—probability, gain, and physics—using data regimes designed to compare more questions, higher operation-set coverage, higher structural diversity, and more solutions per question. In the reported table, high-coverage “more questions” reaches 0.792 on probability, 0.82 on gain, and 0.77 on physics; the “more solutions” setting is lower at 0.771, 0.72, and 0.70 respectively. High diversity matches high coverage on probability but falls behind on gain and physics.

The authors are careful not to claim that the math case reproduces the clean spatial-transfer and length-scaling axes of the synthetic setup. It cannot. Natural-language math problems are messier. But as a practical extension, the case study supports the data-design principle: under tight budgets, exposing the model to more distinct conceptual problem types tends to beat collecting many reasoning traces for fewer problems.

For an AI team, this suggests a dataset audit that separates three questions:

Audit question	Bad answer	Better answer
What primitives are covered?	“We have 50,000 examples.”	“We cover these 120 task families and these 35 exception types.”
How are primitives recombined?	“We paraphrased each example 20 ways.”	“Each task family appears in several materially different dependency structures.”
How many solutions per question?	“We collected ten traces for every case.”	“We add extra traces only where solution ambiguity teaches useful variation.”

This is not glamorous. It is dataset plumbing. Unfortunately, many AI failures are plumbing failures wearing a philosophical hat.

Longer tasks need neighboring longer examples, not just more short practice

The paper’s data story changes when the target is length scaling rather than spatial transfer.

For spatial transfer, broad coverage and distinct questions do most of the work. For longer paths, that is not enough. Even the strongest spatial-transfer model fails once the evaluation path length moves beyond the training maximum.

The authors then test whether length scaling can be rescued by adding training paths of different lengths. The result is specific and practical: adding a very small fraction of paths at or slightly above the target length can substantially improve performance, while adding shorter paths gives little benefit and adding much longer paths can degrade performance.

This resembles curriculum design. The model does not merely need “more data.” It needs exposure near the extrapolation boundary. Slightly longer examples teach the model how to extend the learned rule into the next horizon. Much shorter examples repeat what it already knows. Much longer examples may be too far from the current learned regime and introduce confusion rather than capability.

For business automation, this maps cleanly onto staged rollout design.

Do not validate an agent only on short workflows and then deploy it into long ones. Also do not jump from five-step examples to fifty-step examples and hope scale will perform a small miracle. Build a horizon curriculum:

Deployment stage	Evaluation focus	Data implication
Short workflow	Local correctness and tool-use accuracy	Cover core task families.
Medium workflow	Dependency tracking and state consistency	Add cases slightly longer than the current comfort zone.
Long workflow	Recursive stability under accumulated constraints	Use checkpoints, decomposition, and targeted long-horizon examples.
Production monitoring	Drift, contradiction, and stale intermediate state	Track error rate by step depth, not only by final task type.

The paper does not prove this exact recipe for enterprise agents. It does something more modest and more useful: it shows why the recipe is plausible. If length failure is a distinct axis, then length-aware evaluation and training are not optional extras. They are the test of whether the workflow demo survives contact with reality.

Reinforcement learning stabilizes training; it does not automatically expand capability

The paper then compares supervised fine-tuning and reinforcement learning. This is where the result becomes unfashionable in a productive way.

Reinforcement learning is often discussed as the thing that turns language models into better reasoners. Sometimes it helps substantially, especially when good answers are easy to verify but hard to generate. The shortest-path setting is a useful test because the reward is clean: a generated path either forms a valid shortest path or it does not.

The authors train with a GRPO-style reinforcement-learning setup using binary rewards and compare it against SFT under several conditions. The result is not that RL is useless. The result is more precise:

Training paradigm	What improves	What does not improve
SFT	Can reach high peak performance with sufficient, well-designed data.	Can overfit with extended training; does not solve length scaling by itself.
RL	Stabilizes training and reduces degradation under prolonged optimization.	Does not surpass the best SFT ceiling; error patterns remain similar.

In spatial transfer, RL does not exceed the best fully trained SFT model. In length scaling, extended RL training remains stable, while SFT can improve early and then overfit. But RL still does not break through the best SFT bound. The appendix’s qualitative failure analysis reinforces this: SFT and GRPO show nearly identical error categories across length groups, including valid but non-shortest paths, failure to reach the target, and invalid moves.

The business interpretation should be disciplined. RL can be an operations tool. It can stabilize behavior when supervision is noisy, when sequence-level reward matters, or when the training process risks overfitting. But this paper does not support the magical version of RL in which reward optimization automatically creates a longer-horizon planning capability that the supervised model lacked.

If the data do not expose the needed horizon, and the base model has not learned stable recursive composition, RL may polish the surface rather than move the frontier. The surface may look better. It is still the same room.

Inference-time search lifts the curve but does not change its shape

The final major comparison concerns inference-time strategies. Perhaps the model has the capability, but greedy decoding fails to reveal it. Generate multiple paths, select the best one, and maybe the length-scaling problem goes away.

The paper tests this with majority-of-10 sampling and shortest-of-10 selection. The latter is especially favorable because the task has an objective: the shortest valid path is preferred.

The result is again measured rather than theatrical. Inference-time search improves success rates for both SFT and RL models. But the degradation trend remains. Search shifts the curve upward; it does not remove the length-scaling failure. Moreover, RL models remain below SFT counterparts under the same inference strategies, and even strong objective-guided sampling for RL only reaches roughly the level of SFT greedy decoding.

For product architecture, this matters because many agent systems rely on orchestration layers, self-consistency, best-of-N sampling, tool retries, or verifier loops to raise reliability. Those methods can help. They are not fake. But if the base model’s long-horizon composition is weak, inference-time search becomes an expensive compensator, not a cure.

This is the difference between:

Strategy	Good use	Bad use
Best-of-N sampling	Improve outputs when good candidates are already in the model’s reachable solution space.	Pretend the model has learned a missing capability.
Verifier selection	Filter among candidate outputs when verification is reliable.	Replace task understanding with post-hoc gambling.
Tool retries	Handle stochastic failures and transient tool errors.	Hide systematic drift across long workflows.
Workflow decomposition	Reduce horizon length and isolate state updates.	Create many subtasks without checking cross-step consistency.

There is a small irony here. Test-time compute is often sold as “more thinking.” In this setting, it is closer to “more attempts.” More attempts are useful when success is nearby. They are less useful when the system cannot maintain the underlying structure over longer recursion.

Casinos also understand the emotional appeal of more attempts. This is not a compliment.

How to read the paper’s experiments without overclaiming them

The paper is valuable partly because different experiments serve different roles. Treating all of them as one undifferentiated pile of “results” would lose the point.

Paper component	Likely purpose	What it supports	What it does not prove
Spatial transfer on disjoint maps	Main evidence	Models can apply learned shortest-path behavior to unseen map worlds.	That they will transfer equally well in natural enterprise domains.
Length scaling tests	Main evidence	Long-horizon failure is distinct from spatial-transfer failure.	That all LLMs fail all long-horizon tasks in the same way.
Subpath decomposition	Mechanism test	Recursive instability is a major driver of length failure.	A complete causal account of every long-horizon failure mode.
More questions vs more answers	Data-selection ablation	Distinct questions provide more transfer value than multiple solutions per question under fixed budget.	That solution diversity is never useful.
Coverage vs diversity	Data-property ablation and sensitivity analysis	Coverage sets the ceiling; modest diversity helps; excessive diversity at low coverage can hurt.	A universal numeric threshold for real datasets.
Slightly longer path augmentation	Length-scaling intervention	Exposure near the target length can rescue performance better than shorter examples.	A general curriculum law for every reasoning domain.
SFT vs RL	Training-paradigm comparison	RL stabilizes but does not surpass the best SFT ceiling in this setup.	That RL is commercially irrelevant.
Majority-of-10 and shortest-of-10	Inference-time strategy comparison	Search improves success but does not fix length scaling.	That inference-time compute is not worth using.
MathQA case study	Practical extension	The data-allocation principle has support beyond synthetic maps.	Full equivalence between map navigation and natural-language math reasoning.

This table is not a limitation dump. It is a user manual. The paper is strongest when used as a diagnostic framework for generalization, not as a direct benchmark for enterprise agents.

What Cognaptus would infer for business AI systems

The paper directly shows behavior in a controlled synthetic shortest-path setup, with supporting evidence from MathQA. From that, Cognaptus can infer several practical design principles for business AI systems. These are inferences, not claims that the paper ran an enterprise procurement workflow on your ERP stack. Sadly, science remains inconsiderate like that.

1. Evaluate transfer and horizon separately

An AI workflow should not receive a single pass/fail score. It should have at least two reliability curves:

Evaluation axis	Example question	Useful metric
Domain transfer	Can the agent handle a new business unit, supplier format, or customer category?	Accuracy by domain shift.
Horizon scaling	Can it handle more steps, dependencies, or decision depth?	Success rate by step count or dependency length.

A model that transfers across departments may still fail on longer multi-step processes inside one department. Conversely, a model that handles a long familiar workflow may fail when the same workflow appears in a new operational context. These are different risks.

2. Spend data budget on coverage before repetition

For internal fine-tuning, retrieval libraries, evaluation sets, and agent memory design, coverage should be explicit. A useful dataset inventory should list task families, exception classes, document types, decision constraints, and dependency structures.

Collecting many solutions for the same narrow set of cases may improve fluency and local robustness, but this paper suggests it is a weaker first priority than broad exposure to distinct questions. The first budget should buy coverage. The second budget can buy variation.

3. Build length curricula for autonomous workflows

If a company wants agents to run longer workflows, it should not rely on short-task validation. Create examples and tests at increasing horizon lengths: three steps, seven steps, fifteen steps, and beyond. Add examples slightly beyond the current performance boundary. Track where failure begins.

This is more informative than a single benchmark average. Averages hide cliffs. Long-horizon agents tend to live near cliffs.

4. Use RL as a stabilizer, not as mythology

RL can still be valuable. In noisy business environments, sequence-level reward and verifier feedback can make systems more robust. But the paper warns against treating RL as a guaranteed capability expander. If the underlying data and task exposure do not support the needed behavior, RL may only make the model more consistently limited.

Consistency is good. Consistently wrong is less good.

5. Treat inference-time search as a budgeted reliability layer

Sampling, verification, and best-of-N selection should be evaluated by marginal value. They can raise success rates when better candidates are present in the generated set. But if longer tasks systematically fail because the model cannot maintain recursive stability, search will not solve the root problem.

In production terms: test-time compute should have an ROI curve. More retries should not be a confession booth for weak design.

The boundary: synthetic clarity is not enterprise realism

The main limitation is obvious and important. The paper’s cleanest evidence comes from a controlled synthetic environment with relatively small LLaMA-style transformer models trained from scratch on map data. That control is exactly why the mechanism is visible. It is also why we should not mechanically transfer every numeric result into business operations.

Real enterprise workflows include ambiguous objectives, shifting policies, missing documents, tool latency, human approvals, and politics. Shortest path has one delightful advantage over corporate life: the objective is actually well defined.

The MathQA case study helps bridge the gap, but it is still a case study. It supports the data-allocation principle—more distinct questions and broader operation-set coverage can outperform more solutions per question—but it does not turn the synthetic map findings into a universal law for all reasoning domains.

So the safest business use of this paper is diagnostic:

separate spatial/domain transfer from horizon scaling;
measure failure by task length, not only task category;
audit data coverage before celebrating dataset size;
treat RL and inference-time search as stabilizers and amplifiers, not miracle engines;
design workflows with decomposition and checkpoints when sequential depth grows.

That is already enough. Not every paper needs to be a deployment recipe. Some papers are more valuable because they clean the lens.

The quiet lesson: the map is not the journey

The paper’s core message is not that language models cannot generalize. In fact, one of its more interesting findings is positive: models can show strong spatial transfer in a carefully controlled compositional task. They can learn reusable structure.

The warning is narrower and sharper. Reusable structure does not automatically imply stable recursive execution. A model can know how to move toward the destination and still lose the route when the path gets long.

For AI agents, that distinction may define the next phase of serious evaluation. The question is no longer just whether the system can solve the task. It is whether the system can keep solving the task as the context stretches, dependencies accumulate, and earlier decisions become the ground on which later decisions stand.

In business language: do not ask only whether the agent knows the map. Ask whether it can finish the journey.

Cognaptus: Automate the Present, Incubate the Future.

Yao Tong, Jiayuan Ye, Anastasia Borovykh, and Reza Shokri, “Generalization in LLM Problem Solving: The Case of the Shortest Path,” arXiv:2604.15306, 2026. https://arxiv.org/abs/2604.15306 ↩︎

The useful contrast is not “can it reason?” but “which kind of generalization failed?”#

The failure is recursive instability, not merely harder subproblems#

More questions beat more answers because transfer needs breadth before repetition#

Coverage beats theatrical diversity, but only after minimal diversity exists#

Longer tasks need neighboring longer examples, not just more short practice#

Reinforcement learning stabilizes training; it does not automatically expand capability#

Inference-time search lifts the curve but does not change its shape#

How to read the paper’s experiments without overclaiming them#

What Cognaptus would infer for business AI systems#

1. Evaluate transfer and horizon separately#

2. Spend data budget on coverage before repetition#

3. Build length curricula for autonomous workflows#

4. Use RL as a stabilizer, not as mythology#

5. Treat inference-time search as a budgeted reliability layer#

The boundary: synthetic clarity is not enterprise realism#

The quiet lesson: the map is not the journey#