Grid.
It looks like the friendliest possible structure. Rows, columns, symbols, rules. No blurry photos, no social nuance, no awkward customer email written at 1:13 a.m. Just a small board and a set of constraints.
Naturally, this is where modern reasoning models still manage to embarrass themselves.
The paper introducing TopoBench studies a deceptively simple question: can frontier large language models solve topology-heavy grid puzzles where the answer depends on connectivity, loop closure, symmetry, visibility, and state consistency?1 The answer is not “never.” That would be too easy. The answer is more annoying: models often understand enough to start correctly, reason long enough to sound competent, and then lose the structure that makes the solution valid.
That distinction matters. A model that cannot start is easy to diagnose. A model that starts well and then violates a hidden global constraint is the one you accidentally put inside an operations workflow.
TopoBench is useful because it does not merely report that models fail. It asks where the failure enters the system. The paper’s most business-relevant result is not that hard puzzles are hard. We had suspicions. The sharper lesson is that many failures come from a brittle conversion layer: spatial layout becomes text tokens, text tokens become an internal board state, the board state becomes constraints, and only then does reasoning begin.
When that conversion layer is unreliable, a stronger model, a longer prompt, or a picture of the grid may not solve the problem. The model is not necessarily missing “intelligence” in the abstract. It is missing a dependable constraint compiler.
A grid becomes hard when local moves must preserve the whole system
TopoBench evaluates six puzzle families. Each family forces the model to preserve a different kind of global spatial invariant.
| Puzzle family | Core constraint | Why it stresses LLM reasoning |
|---|---|---|
| Flow Free | Path connectivity | Connect matching endpoints without crossing paths while filling the board. |
| Bridges | Network connectivity | Satisfy island degree counts while keeping the network connected and avoiding crossings. |
| Loopy | Loop closure | Draw one continuous closed loop satisfying local edge counts. |
| Galaxies | Rotational symmetry | Partition regions so each region is symmetric around a center. |
| Undead | Reflection and visibility | Place monsters while tracking line-of-sight through mirrors. |
| Pattern | Contiguity across axes | Fill binary grids to match row and column run-length clues. |
These tasks are not primarily about recalling facts. They are also not ordinary arithmetic dressed up in a puzzle costume. A single move can satisfy a local rule while damaging the entire board. In Bridges, for example, an island may still have remaining capacity, but a bridge can create a crossing or isolate a component. In Loopy, local edge counts can look plausible while failing to form one continuous loop. In Undead, visibility through mirrors turns a small grid into a state-tracking trap with decorative monsters. Cute, in the way a tax audit is cute.
The benchmark contains 900 instances: six puzzle families, three difficulty tiers, and 50 puzzles per family-tier combination. The authors use puzzle-specific verifiers, which is important because many of these puzzles can have multiple valid solutions. The model is not graded for matching a single answer key. It is graded for satisfying the actual constraints.
That design choice makes TopoBench more than a leaderboard. It tests whether models can maintain a structured world state under repeated local updates.
The score table is not the story; the collapse pattern is
The headline result is severe enough. On hard instances, the strongest tested model, GPT-5 Mini, averages 0.24 accuracy. Gemini 3 Flash averages 0.09. DeepSeek V3.2, the strongest open-weight model in the main hard-tier table, averages 0.10.
But the average hides a more useful pattern.
On hard Loopy and Galaxies, the reported accuracies are essentially zero across the frontier models in the main result table. By contrast, Bridges and Pattern retain some measurable performance. GPT-5 Mini reaches 0.44 on hard Bridges and 0.44 on hard Pattern; DeepSeek V3.2 reaches 0.40 on hard Bridges but only 0.12 on hard Pattern. Undead is especially uneven: GPT-5 Mini reaches 0.52 on hard Undead, while Gemini 3 Flash reaches 0.00 and DeepSeek V3.2 reaches 0.10.
This is not a smooth scaling story where larger or newer models simply climb the same hill. It is a jagged map of capability boundaries.
| Observation | Interpretation | Boundary |
|---|---|---|
| Frontier models separate clearly from weaker baselines on easy tiers. | Model capability matters. TopoBench is not just random failure. | The benchmark is still limited to generated puzzle families. |
| Accuracy collapses sharply on hard tiers. | Deduction depth and state maintenance remain fragile. | Hard-tier difficulty differs by puzzle family, so averages should not be overread. |
| Loopy and Galaxies remain nearly unsolved beyond easy cases. | Single global invariants such as loop closure and symmetry are especially difficult. | The paper does not prove these are universally the hardest spatial constraints. |
| Bridges remains partially solvable for frontier models. | Some topology-heavy reasoning exists once the model has usable constraints. | Bridges also becomes the main site for later tool experiments, so its conclusions travel best to similar tasks. |
The old, comfortable interpretation would be: “LLMs are bad at spatial reasoning.” True, but too blunt. The paper pushes toward a more precise claim: models can often reason over constraints once those constraints are available, but they struggle to extract and maintain the constraints from spatial representations.
That is the mechanism worth following.
The real failure chain starts before reasoning looks wrong
A common reader misconception is that spatial reasoning failure should be fixed by one of three things: a better reasoning model, a longer chain of thought, or a multimodal input. Give the model a picture, tell it to think harder, and perhaps the grid will behave.
TopoBench makes that view look optimistic. Not impossible. Just optimistic in the way “we will fix the process later” is optimistic.
The paper’s mechanism can be summarized as a four-step failure chain:
- A two-dimensional puzzle is serialized into a text input.
- Tokenization breaks the board into uneven pieces that do not always align with cells.
- The model reconstructs a board state and derives constraints from that imperfect representation.
- Multi-step reasoning then proceeds on a state that may already be incomplete, drifted, or semantically invalid.
The uncomfortable part is step 3. The model must convert layout into algebra: cells, islands, paths, regions, visibility lines, remaining counts, legal moves, connected components. Humans do this visually and often silently. A language model has to rebuild it from tokens.
This is why the paper’s input-format intervention matters. Plain ASCII grids preserve layout for humans, but not necessarily for tokenizers. Adjacent characters can merge into tokens in ways that straddle cell boundaries. Rows with the same number of cells can become rows with different token structures. The board is visually rectangular but computationally ragged.
The authors test alternatives such as comma-separated integer encodings and JSON-like integer grids. These formats use more tokens, but they preserve cell alignment more consistently. The effect is not uniform, which is exactly the point. On several puzzle families, integer formats produce large gains. For example, Gemini 3 Flash gains roughly 37–39 percentage points on Bridges and roughly 29–32 points on Galaxies under integer formats. GPT-5 Mini gains 12–15 points on Bridges and 15–17 points on Galaxies. DeepSeek V3.2 gains 17 points on Bridges and 23–28 points on Galaxies.
Then the caveat: Undead and Pattern often get worse under the same formats. DeepSeek V3.2 drops 26–37 points on Undead under integer encodings, and GPT-5 Mini drops 13–22 points on Pattern.
That is not a contradiction. It is a clue. Representation is not a cosmetic wrapper around reasoning. It changes the actual task the model performs. A format that clarifies cell boundaries may damage other cues that a model had learned to exploit. The safe business conclusion is therefore not “always convert grids to integers.” It is “treat representation as a tested interface, not a neutral input choice.”
Frequent errors are not always causal errors
The paper’s diagnostic section is better than the usual “the model made mistakes” taxonomy because it separates observed error frequency from causal damage.
The authors analyze 750 chain-of-thought traces from DeepSeek V3.2 across five puzzle families and three difficulty tiers, excluding Loopy because performance is near zero beyond easy cases. They then focus on 455 incorrect traces. A GPT-5-mini judge labels reasoning traces with an error taxonomy, which includes repeated reasoning, state-tracking failure, constraint forgetting, premature commitment, explicit surrender, incomplete output, and representation drift, along with lower-frequency categories in the appendix.
If the analysis stopped there, the obvious headline would be explicit surrender. It appears in 76% of failed traces. Repeated reasoning, premature commitment, and representation drift also appear frequently.
But surrender is usually not the cause. It is the smoke after the kitchen has already burned down. The authors note that explicit-surrender traces are long and cluster near the token ceiling. The model has already tried extensively before giving up.
The causal tests are more revealing. The authors inject specific error types into partial gold solution prefixes and measure whether downstream accuracy falls when the model continues solving. This is the paper’s most important methodological move.
| Test | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| Observational trace labeling | Identify visible failure patterns in model reasoning. | Shows which errors appear in failed reasoning traces. | Does not establish that frequent errors caused the failure. |
| Premature commitment injection | Test whether an early wrong branch prevents recovery. | Shows wrong early trajectories are highly damaging, especially in larger search spaces. | Does not isolate all forms of search or planning failure. |
| Constraint forgetting injection | Test whether a single internally consistent rule violation poisons later reasoning. | Shows semantic constraint violations are disproportionately harmful. | Does not prove all constraint errors are equally damaging. |
| State-tracking failure injection | Test whether corrupted board state harms continuation. | Shows richer-state puzzles such as Undead are sensitive to state inconsistency. | Effects are weaker or borderline in Bridges. |
| Repeated-reasoning injection and controls | Test whether cycles or extra context length cause failure. | Suggests repeated reasoning is more symptom than cause. | Does not mean inefficient search is never costly in other settings. |
The result is wonderfully inconvenient for dashboard-driven diagnosis. Repeated reasoning is common but not clearly causal. Constraint forgetting is rare, but when injected, it causes large accuracy drops. On Bridges, premature commitment produces a 20.8 percentage-point drop, while constraint forgetting produces a 10.6-point drop. On Undead, premature commitment, constraint forgetting, and state-tracking failure each land around an 11-point drop.
The business lesson is direct: do not rank AI failure modes only by how often they appear in logs. Some failures are frequent because the system is already lost. Others are rare because they occur early and silently, but once they appear, the rest of the process builds on a corrupted state.
That second class is the expensive one.
Constraint forgetting is dangerous because it looks internally consistent
The difference between state-tracking failure and constraint forgetting is subtle but operationally important.
A state-tracking failure creates inconsistency between representations. The model says one thing in the action log, while the board state shows another. This is bad, but at least there is a mismatch to notice. The model may recover by cross-checking one representation against the other.
Constraint forgetting is nastier. It creates a state that is internally consistent but illegal. In Bridges, the action text and board may agree that a bridge was placed. The problem is that the move violated a rule: overcounting an island, crossing a bridge, or modifying a count. The representation is coherent. It is just wrong.
That is precisely the kind of error a fluent model can carry forward with confidence. There is no obvious contradiction in the text. The only way to catch the problem is active constraint verification.
This is the point at which “reasoning” becomes the wrong abstraction. The model does not merely need to think. It needs an authoritative way to ask: is the current state legal?
In business systems, the equivalent appears everywhere:
| AI workflow | Hidden constraint | Failure pattern |
|---|---|---|
| Logistics planning | Vehicle capacity, route continuity, delivery windows | A route looks plausible but violates a hard operational constraint. |
| Compliance review | Mandatory clauses, jurisdiction-specific rules, filing dependencies | A document sounds complete but misses a required condition. |
| Process automation | State transitions, approval gates, exception rules | An agent advances a case from an invalid state. |
| Data transformation | Schema constraints, referential integrity, unit consistency | Output is readable but semantically invalid. |
| Scheduling and resource allocation | Availability, precedence, capacity, conflict rules | A plan is locally sensible but globally impossible. |
This is why a verifier is not an optional luxury. It is the part of the system that tells the model reality did not sign off on its paragraph.
Pictures do not fix a broken constraint compiler
One of the more useful results in the paper is negative: adding images does not reliably help.
That matters because multimodality is often treated as the obvious fix for spatial reasoning. If the model struggles with text grids, show it the grid. Problem solved. The grid is now visible. Please proceed to intelligence.
TopoBench does not support that simple view. The image-plus-ASCII condition produces mixed or negative results. GPT-5 Mini, for example, loses 18 points on Pattern and 5.3 points on Undead with image input, while gaining only small amounts on some other families. Gemini 3 Flash gains on Bridges and Pattern but loses on Loopy and Undead.
The interpretation is not that images are useless. It is that raw spatial renderings do not automatically become structured constraints. A puzzle image still has to be parsed into cells, edges, counts, regions, paths, and legal moves. If the bottleneck is constraint extraction, then another sensory representation may only add another parsing burden.
For enterprise AI, the analogy is obvious. A scanned form, workflow diagram, floor plan, network map, or spreadsheet screenshot may be visually clear to a human and still operationally poor as an agent input. If the agent needs constraints, give it constraints. Do not merely give it prettier pixels and hope the latent space has had a productive morning.
Tools help when they return state, not vibes
The strongest evidence for the mechanism comes from the tool-augmented Bridges experiment.
The setup is carefully limited. The model receives the original ASCII puzzle, but can interact with an external puzzle engine. The tools do not solve the puzzle or suggest the best move. They maintain authoritative board state and expose information in different forms.
The available tools fall into two broad types:
| Tool type | Example | What it gives the model |
|---|---|---|
| State mutation and rendering | make_move, render_board |
Applies moves and returns the current board as an ASCII grid. |
| Structured constraint queries | state_summary, neighbors, components |
Returns remaining counts, legal moves, connectivity information, and derived constraint data. |
On hard Bridges, the no-tool baseline is 40% accuracy with only 50% board-valid outputs. With structured-only tools, accuracy rises to 46% and board validity reaches 100%. With the full structured suite, accuracy reaches 50%. Medium Bridges also improves under the full tool suite, from 80% to 92%, while token use falls by about 11%.
The ablation is even more interesting. If the model receives only make_move and render_board, accuracy falls to 26%, below the no-tool baseline, even though board validity rises to 96%. The model is interacting with a valid board, but the repeated rendered grid does not give it the derived constraints it needs. It calls render_board frequently and still struggles to extract bridge counts, legal moves, and connectivity.
Adding state_summary creates the largest gain in the tool ablation: from 26% to 42%. Adding neighbors and components gives smaller additional improvements. Removing the spatial grid when structured tools are available can help or at least does not hurt.
The paper’s conclusion is precise: structured state information helps more than repeated spatial renderings. The model can reason over algebraic constraints better than it can reliably extract those constraints from a grid.
That is the architecture lesson.
A model should not always be asked to infer the business state from a messy representation. In many workflows, the better design is:
- Keep an authoritative state outside the model.
- Expose that state through structured queries.
- Let the model propose actions.
- Validate each action against hard constraints.
- Feed back legal state changes, not just another visual or textual dump.
This is less glamorous than “an autonomous agent that sees everything.” It is also less likely to quietly destroy your workflow while writing a confident explanation of why it was correct.
Prompting is the weakest lever when the interface is wrong
The authors also test prompt-level interventions aimed at premature commitment: backtracking instructions, planning instructions, self-correction instructions, worked examples, gold-path examples, and recovery demonstrations.
None significantly improves hard Bridges performance. Many hurt. The baseline reaches 60% across medium and hard Bridges in the prompt-intervention setting, with 40% on hard puzzles. Several prompt additions reduce hard accuracy to the 24–30% range. The concise premature-commitment recovery example roughly matches baseline overall, but does not clearly improve it.
This result should be read with discipline. It does not mean prompts never matter. It means that, for this model and this setting, adding strategic advice did not reliably create better search, backtracking, or verification behavior. The model’s extended reasoning process appears to dominate over the prompt’s polite suggestions.
That is familiar in business deployments. Telling an agent “be careful,” “verify your work,” or “do not violate constraints” often improves the transcript more than the outcome. The model may mention verification without having an actual verifier. A well-worded warning label is not a control system.
Prompting is useful when the model has the information and capability but needs task framing. It is weak when the system lacks a reliable state representation or hard constraint checker.
What Cognaptus should infer for business AI systems
The paper directly shows results on puzzle families, not enterprise workflows. So the business interpretation should be an architectural inference, not a KPI claim. TopoBench does not prove that a logistics agent will fail at a specific rate, or that a compliance assistant will miss a specific percentage of clauses.
What it does show is a failure pattern highly relevant to business automation: models can lose global constraints while maintaining fluent local reasoning.
That pattern appears whenever an AI system must manipulate a structured state over time. The state may be a grid, a route, a contract, a database record, a workflow ticket, a resource schedule, a process map, or a set of compliance obligations. The common feature is not geometry. The common feature is invariant preservation.
For practical system design, the lesson is not “avoid LLMs.” That would be boring, and also wrong. The lesson is to stop pretending that the model’s context window is a database, a constraint solver, and a state machine wearing a trench coat.
A better enterprise pattern looks like this:
| Design layer | What it should do | Why TopoBench makes it important |
|---|---|---|
| Canonical representation | Convert messy inputs into stable structured objects. | Input format changes materially alter performance. |
| Authoritative state tracker | Maintain the current state outside the model. | Models drift or corrupt state over long reasoning chains. |
| Constraint-query interface | Return derived quantities such as remaining capacity, legal actions, dependencies, and connectivity. | Structured tools outperform repeated spatial renderings. |
| Action validator | Reject illegal moves before they become part of the state. | Constraint forgetting is rare but highly damaging. |
| Recovery mechanism | Support backtracking, rollback, and branch comparison. | Premature commitment causes large downstream accuracy drops. |
| Evaluation harness | Test by constraint satisfaction, not by plausible explanation. | Verifier-checked scoring reveals failures that text review may miss. |
This is also where ROI enters the picture, but not in the lazy “AI will automate everything” way. The economic value is cheaper diagnosis and safer throughput. If a workflow fails because the model cannot reliably extract constraints from a document, the fix may be a parser, schema, verifier, or tool API — not a larger model subscription. If a workflow fails because illegal states are accepted, the fix is validation before state mutation. If a workflow fails because early wrong choices poison the rest of the process, the fix is search and rollback, not a longer inspirational prompt.
A constraint-aware architecture may look less magical. Good. Magic has terrible audit logs.
Where the evidence stops
The paper is careful about its own boundaries, and the business reading should be just as careful.
First, the causal intervention analysis focuses on DeepSeek V3.2 and primarily on Bridges and Undead. That is enough to reveal mechanisms, but not enough to claim universal causal rankings across all models and all spatial tasks.
Second, tool augmentation is evaluated most deeply on Bridges. Bridges is useful because it has explicit degree, legality, and connectivity constraints, and because baseline performance leaves room for improvement. But results may differ for tasks where the key constraint is visual symmetry, free-form geometry, or domain-specific semantics.
Third, the chain-of-thought analysis depends on visible reasoning traces. Such traces are useful for diagnosis, but they are not guaranteed to faithfully expose the model’s internal computation. A trace can reveal behavior without being a perfect causal record.
Fourth, all mitigation results are inference-time and pass@1. The paper does not settle whether training, reinforcement learning, best-of sampling, voting, or specialized search would change the picture. It also does not prove that tool-augmented reasoning is always superior to training models to internalize constraint verification.
These boundaries do not weaken the central lesson. They specify where to apply it. Use TopoBench as evidence for an architecture principle: when tasks require persistent structured constraints, the interface between representation, state, and verification is part of the intelligence system.
The grid was never the simple part
TopoBench is easy to underestimate because puzzles look small. That is the trap. The grid is not difficult because it is large. It is difficult because every local decision must remain compatible with a global structure.
That is also why the paper travels beyond puzzles. Business processes are full of small grids disguised as normal work: tickets moving through approval states, contracts accumulating obligations, shipments crossing capacity limits, schedules preserving dependencies, databases maintaining referential integrity, compliance workflows requiring every required field and condition to remain valid.
A model can describe these systems fluently and still fail to maintain their invariants.
The practical conclusion is therefore not that frontier LLMs are useless at structured reasoning. TopoBench shows something more specific and more useful: the model may be able to reason once the right constraints are available, but it is unreliable at extracting and preserving those constraints from messy spatial or textual representations.
So the next time an AI agent gets lost in a grid, the question should not be only “which model did we use?”
Ask what state it was trusted to remember. Ask what constraints it had to infer. Ask what verifier was allowed to say no.
The grid did not defeat the model by being mysterious. It defeated the system by making structure implicit.
And implicit structure is where confident automation goes to develop expensive hobbies.
Cognaptus: Automate the Present, Incubate the Future.
-
Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque, Noel O’Connor, and Fergal Reid, “TopoBench: Benchmarking LLMs on Hard Topological Reasoning,” arXiv:2603.12133, 2026, https://arxiv.org/abs/2603.12133. ↩︎