The city does not answer literal questions
A person says, “I’m thirsty.”
A human does not usually reply, “Please specify whether you require a vending machine, café, convenience store, supermarket, juice shop, water fountain, or bubble tea store.” That would be technically attentive and socially catastrophic. A human looks around, remembers what cities usually contain, infers which places can satisfy the need, and starts walking toward a plausible target.
That small act is the real intelligence problem behind CitySeeker, a new benchmark for embodied urban navigation with implicit human needs.1 It is not asking whether a vision-language model can follow “walk forward, turn left, stop at McDonald’s.” It is asking whether a model can hear “I’m thirsty,” convert that into a search space, interpret street-level visual cues, choose directions over multiple steps, avoid loops, and stop at a place that actually solves the need.
The answer, for now, is: not very well.
CitySeeker contains 6,440 trajectories across eight cities and evaluates 27 vision-language models. The top reported model, Qwen2.5-VL-32B, reaches only 21.1% task completion under the paper’s 50-meter proximity metric. Exact endpoint success is much lower, at 2.6%. Even the human baseline is not heroic, reaching 30.1% proximity completion and 5.7% exact completion, which tells us something important: this task is hard by design. But the model failures are not merely “humans are better.” They reveal where current multimodal agents still lack the kind of mundane urban intelligence that people use without naming it.
The tempting misconception is that a strong VLM plus a map should basically solve navigation. CitySeeker is useful because it quietly ruins that assumption.
The paper’s best lesson is not “models score low on a new benchmark.” We have plenty of those. The more useful lesson is the failure chain: implicit need parsing, affordance reasoning, visual grounding, waypoint decisions, memory, correction, and map-view alignment. Break any one of these and the agent becomes the urban equivalent of a very expensive tourist staring at a blue line.
The task begins before navigation starts
Most navigation benchmarks give the agent a destination or a step-by-step route. CitySeeker removes that comfort. The user’s request may directly name a place, but it may also name a need, an attribute, or a social preference.
The benchmark organizes tasks into seven categories:
| Task category | What the model must do | Example type |
|---|---|---|
| Basic POI | Find a directly named point of interest | nearest restaurant |
| Brand-specific | Recognize a named brand or chain | Starbucks |
| Transit hub | Locate mobility infrastructure | subway station |
| Latent POI | Infer a target that may be inside or attached to another place | restroom inside a mall or fast-food store |
| Abstract demand | Convert a human need into possible POIs | “I’m thirsty” |
| Inclusive infrastructure | Find a place with an accessibility attribute | accessible entrance |
| Semantic preference | Interpret subjective criteria | upscale or family-friendly restaurant |
This design matters because implicit-need navigation is not a single problem. It is a stack of conversions.
“I want to work with Wi-Fi” is not a destination. It is a constraint bundle. The model must infer that cafés, coffee shops, libraries, bookstores, and some fast-food restaurants may be valid. It must then identify which candidates exist nearby, which are visible from street view, and whether the current route is moving toward one. A map label alone may not be enough. A storefront sign alone may not be enough. A common-sense category alone may not be enough.
This is why the paper’s benchmark construction is more than dataset bookkeeping. CitySeeker associates instructions with POI categories, builds route graphs from street-view panoramas, links visible POIs to nearby graph nodes, and manually validates trajectories. For abstract demand and latent POI tasks, the authors also run a cross-cultural consensus survey. Their pre-defined need-to-POI mappings receive an 83.39% global average consensus, while unrelated POI categories receive only 1.90%.
That survey is not the main result, but it protects the benchmark from an easy objection: “Maybe the authors invented strange mappings.” For many common needs, people broadly agree on the candidate places. If the model cannot infer them, the issue is not just cultural taste. It is a missing layer of urban affordance reasoning.
Seeing a sign is not the same as understanding what it affords
CitySeeker separates direct recognition from deeper inference, and the results behave accordingly.
Models do relatively better when the task gives them a strong lexical or visual anchor. Brand-specific navigation is easier because “Starbucks” is both a word and a visual target. The best Qwen2.5-VL-32B result in that category is 30.4% TCP, higher than its overall 21.1%.
Latent POI tasks are much harder. A restroom may not be visible as a sign on the street. It may be inside a McDonald’s, a KFC, a shopping mall, a subway station, or a public facility. That requires the model to reason through secondary functions. The paper’s error analysis gives the useful phrase here: underthinking and overthinking. A model may recognize Starbucks but fail to treat it as a place with Wi-Fi. Or it may see CVS Pharmacy and overread it as a convenience store.
This is a familiar pattern in business AI deployments. Systems often perform impressively when the target is named in the input and appears cleanly in the data. They struggle when the user names the problem rather than the solution.
That distinction matters for city agents, customer service agents, procurement assistants, internal enterprise copilots, and almost every “AI assistant” pitch deck currently wandering around the internet in a blazer. Users do not always ask for entities. They ask for outcomes. The agent must translate outcomes into candidate actions without collapsing into either literalism or fantasy.
CitySeeker makes that translation measurable in a physical environment.
The main benchmark result is a bottleneck map, not a leaderboard
The leaderboard is easy to quote but easy to misread.
Qwen2.5-VL-32B is the top overall model in the main table, with 21.1% TCP. GPT-4o reaches 18.3%; o4-mini reaches 17.9%; Gemini-2.5-Pro reaches 17.3%; InternVL3-38B reaches 19.3%. Some models underperform random choice on certain metrics or categories. Humans reach 30.1% TCP.
At first glance, this looks like another “models are bad at benchmark X” story. That is the least interesting version.
The better interpretation is that CitySeeker exposes several different failure modes that happen to appear in one task.
| Evidence in the paper | Likely purpose | What it supports | What it does not prove |
|---|---|---|---|
| 27-model evaluation on the full test set | Main evidence | Current VLMs struggle with implicit-need urban navigation | That every future VLM will fail similarly |
| Subcategory performance | Diagnostic evidence | Direct recognition is easier than latent or affordance-heavy reasoning | That category labels alone explain all failures |
| City-level variation | Robustness and bias probe | Geography and visual environment matter | That language localization is the main cause |
| Cross-lingual Beijing/New York test | Sensitivity test | Prompt language does not consistently explain city gaps | That cultural and visual bias are fully solved |
| Map-augmented ablation | Ablation | More map information can improve path alignment while hurting task completion | That maps are useless in deployed systems |
| BCR strategies on 650 samples | Exploratory extension | Memory, backtracking, and topology-aware context can improve performance | That these methods are production-ready or fully optimized |
| Error analysis on 300 samples | Mechanism diagnosis | Cognitive, visual, waypoint, and parsing errors differ by model | That percentages generalize exactly to all settings |
This table is important because not all results carry the same weight. The full-model evaluation establishes the baseline difficulty. The appendix tests explain why simple explanations fail. The BCR studies suggest engineering directions, but they are exploratory and conducted on a smaller subset. A serious reader should not flatten all of this into one big “CitySeeker proves X” blob. We are not making soup.
Longer routes turn small errors into urban drift
In the paper’s navigation setup, the agent receives panoramic street views split into perspective views, reasons in a ReAct-style loop, chooses an action, and continues until it stops or reaches the 35-step limit. The main evaluation deliberately keeps each step independent: the model does not carry persistent memory or previous internal state across decisions.
This design choice may sound artificial, but it isolates the model’s intrinsic spatial reasoning. It also produces an important failure mechanism: error accumulation.
The paper reports that performance degrades as route length increases. Under 20 steps, trajectories are more often manageable. Around 35 steps, path-alignment scores become highly scattered. The model is not merely making one bad choice. It is losing the thread.
That loss appears in several forms:
- Trajectory deviation: one wrong turn compounds into a path that no longer samples useful visual evidence.
- Oscillatory detours: the agent loops or backtracks without a stable plan.
- Premature stopping or overshooting: the agent stops before the target or walks past it.
- Observation–reasoning mismatch: the model observes a useful cue but chooses an action inconsistent with that observation.
- Malformed action output: the model fails to produce executable action fields.
This is where embodied navigation differs from answering a static image question. A single visual mistake is bad. A single wrong action in a sequential environment changes the future information available to the agent. The model does not just answer incorrectly; it walks itself into a worse epistemic position. Elegant, in a tragic little way.
For business use, this is the difference between a chatbot making one weak recommendation and an autonomous system taking a series of actions that progressively reduce recoverability. Delivery robots, mobility assistants, AR city guides, tourism agents, and field-service copilots need active correction, not just stronger first-pass perception.
The map ablation is the paper’s most useful surprise
The most business-relevant result may be the map ablation, because it attacks the obvious product-manager instinct: add a map.
In Appendix D, the authors test a map-augmented setting for GPT-4o and Qwen2.5-VL-32B. The model receives both first-person street view and a dynamically updated 2D map showing the planned route and current heading. It is instructed to analyze the map first, determine the geometrically optimal next step, and then choose the matching street-view perspective.
The result is counterintuitive. Path alignment improves, but task completion collapses.
| Model | Map-free TCP | Map-augmented TCP | Map-free nDTW | Map-augmented nDTW |
|---|---|---|---|---|
| GPT-4o | 18.3% | 11.7% | 136.9 | 75.7 |
| Qwen2.5-VL-32B | 21.1% | 7.6% | 147.0 | 54.4 |
The map helps the model follow a geometric route. It hurts the model’s ability to complete the semantic discovery task.
That distinction is the paper’s sharpest correction to naive deployment thinking. Navigation is not only geometry. If the user says “I’m thirsty,” the agent must keep searching for a satisfying place. A blue line can distract the model from visual exploration. The paper identifies three failure modes: weak 2D map cognition, poor alignment between map directions and first-person perspectives, and trivialization of the task into path-following.
The business implication is not “do not use maps.” That would be a silly conclusion, and silliness has enough market share already. The implication is that map integration must be treated as a representation-alignment problem. The agent must know when the map is helping route geometry and when street-level perception must override or refine it. A deployed system should not simply stuff a map screenshot into a VLM prompt and call the product “spatially aware.”
A map can improve movement while degrading purpose. That is the kind of failure customers notice immediately.
Memory is not decoration; it is part of the navigation system
The authors test three families of exploratory strategies under the acronym BCR: Backtracking, Cognitive-map enrichment, and Retrieval-augmented memory. These experiments run on a 650-sample subset, so they should be read as exploratory engineering evidence rather than full benchmark proof.
Still, the pattern is instructive.
Backtracking mechanisms attempt to correct drift. Basic backtracking relies on low internal confidence. Step-reward backtracking uses objective topological distance. Human-guided backtracking adds a corrective hint after reverting. These strategies generally improve task completion, but simple confidence-based backtracking can hurt weaker models, suggesting that self-assessment is itself a capability. A model that is confused may also be confused about whether it is confused. Delightful.
Spatial cognition enrichment provides external spatial cues. The topology cognitive graph gives explicit connectivity between nodes and actions. The relative position map describes approximate directions and distances. The topology graph is more reliable for task success; relative-position cues can improve path efficiency but sometimes hurt completion.
Memory-based retrieval is the strongest family overall. The paper tests topology-based retrieval, spatial-based retrieval, and historical trajectory lookup. The top reported result in the accepted plan is R1 pushing Qwen2.5-VL-32B to 26.9% TCP. GPT-4o-mini also benefits: R3 reaches 19.4% TCP, while R1 sharply improves nDTW from 337.1 to 136.6.
What does this mean mechanically?
Memory changes the task from isolated local guessing into situated exploration. It gives the agent a record of visited nodes, prior decisions, confidence scores, transition histories, and recent rationales. That helps with two problems at once: avoiding repeated mistakes and reusing successful partial paths.
For businesses designing embodied or location-aware AI systems, the lesson is straightforward: do not treat memory as a personalization feature added after launch. In sequential physical tasks, memory is infrastructure. Without it, the agent keeps rediscovering the same sidewalk.
Human failure and model failure are not the same failure
The human baseline is only 30.1% TCP, so the paper avoids a simplistic “humans good, models bad” framing. Humans also struggle with unfamiliar streets, poor signage, time pressure, and the 35-step budget.
But the error analysis shows that humans and VLMs fail differently.
For humans, the dominant failure category is strategic and navigational: 60.7%. Participants often understand the request but explore inefficiently, forget paths, loop, or overshoot valid targets. For Qwen2.5-VL-32B, strategic and navigational failures are also large at 40.5%, but cognitive failures are more prominent: 32.9% for the model versus 19.1% for humans. The model struggles with non-obvious POI functions and flexible commonsense leaps. Visual and execution errors also remain material: 26.6% for the model versus 20.2% for humans.
This comparison gives a better target for product design than a single success rate.
Humans need planning aids. Models need commonsense, grounding, action consistency, and memory. A useful city assistant may eventually combine both: machine memory and routing discipline, plus human-like affordance reasoning. Current VLMs have pieces of that package, not the package.
What CitySeeker directly shows
CitySeeker directly supports four claims.
First, implicit-need urban navigation is meaningfully harder than explicit route following. It requires semantic inference, visual grounding, and long-horizon action selection.
Second, current VLMs remain weak on this task. The best overall TCP in the main benchmark is 21.1%, and exact completion is extremely low across models.
Third, adding map information naively does not solve the problem. In the tested map-augmented condition, path alignment improves while completion falls, especially for Qwen2.5-VL-32B.
Fourth, memory and correction mechanisms help. Backtracking, topology-aware context, and retrieval-based memory improve performance in exploratory subset studies, with memory-based methods showing particularly strong gains.
These are the paper’s direct contributions. They are enough.
What Cognaptus infers for business use
The business interpretation begins with a shift in product framing.
The commercially valuable task is not “navigate to POI X.” Existing maps already do a decent job there. The more interesting task is “convert vague human intent into grounded urban action.” That is relevant to at least five product categories:
| Product area | CitySeeker-relevant capability | Practical design implication |
|---|---|---|
| Delivery and service robots | Navigate from task intent to useful physical target | Add correction loops and short-term route memory |
| AR city assistants | Interpret vague user needs in street context | Combine visual grounding with affordance databases |
| Tourism and mobility apps | Suggest nearby places from human preferences | Separate candidate generation from route execution |
| Accessibility tools | Find infrastructure attributes, not just place names | Treat attributes as visual and semantic evidence |
| Smart-city interfaces | Support dynamic, local discovery | Use topology and recent observations, not static maps alone |
The inference is not that CitySeeker is a production benchmark for every deployment. It is a diagnostic prototype for a missing capability layer. In business systems, that layer would likely include:
- Need-to-candidate translation: “I need Wi-Fi” becomes libraries, cafés, bookstores, fast-food restaurants, co-working spaces, and possibly malls.
- Affordance scoring: each candidate gets scored by how plausibly it satisfies the user’s actual need.
- Street-level evidence checking: visual cues confirm or reject candidates.
- Topology-aware route memory: the agent records where it has been and which branches failed.
- Correction policy: the system decides when to backtrack, ask the user, consult a map, or switch candidate categories.
- Personalization layer: once general commonsense works, user preferences can rank valid options.
Notice the order. Personalization comes after general spatial commonsense. If the model cannot infer that a café may provide Wi-Fi, learning that Oliver prefers quiet cafés is not enough. A preference layer on top of weak affordance reasoning is just a more tailored mistake.
The deployment boundary is still wide
CitySeeker also defines several practical boundaries.
The benchmark does not distribute raw street-view images; it releases metadata, graph data, panorama IDs, and scripts to re-fetch imagery through official APIs. That is a sensible licensing approach, but it means reproduction depends on API access and external image availability.
The BCR strategies are promising but tested on a 650-sample subset. Combined BCR strategies improve Qwen2.5-VL-32B from 19.9% to 27.38% TCP in a preliminary experiment, but the effect is not strictly additive. Strategy interaction is still an open design problem, not a solved recipe.
Latency remains a serious issue. Real-time navigation requires repeated visual processing and sequential decisions. A city assistant that thinks beautifully but pauses like a committee at every corner will not feel intelligent to users. It will feel like a device considering retirement.
Personalization is also outside the main benchmark. The authors focus on universal common-sense mappings, not individual user history. That is appropriate for a benchmark, but deployed assistants will need behavioral priors, local culture, accessibility needs, budget, safety preferences, and time constraints.
Finally, map fusion is unresolved. The ablation does not mean maps are bad. It means current VLMs can fail to align 2D geometric guidance with first-person visual action. That is a representation problem, and probably a major one.
The useful lesson is the pipeline of failure
CitySeeker is valuable because it turns a fuzzy product aspiration—“AI agents that understand what people need in the city”—into a sequence of measurable failure points.
The model must parse the need. It must infer possible places. It must recognize visual evidence. It must choose a direction. It must remember previous choices. It must correct drift. It must know when to stop. It must use a map without becoming hypnotized by it.
That mechanism-first reading is more useful than a leaderboard summary. The benchmark does not merely say that models are weak. It shows why “more VLM” and “add a map” are incomplete answers.
For Cognaptus readers, the broader pattern should feel familiar. In many AI products, the hard part is not generating fluent text or recognizing obvious objects. The hard part is converting messy human intent into operational decisions under uncertainty. CitySeeker happens to stage that problem on sidewalks, intersections, storefronts, and malls. The same pattern appears in enterprise workflow automation, field operations, customer support, procurement, and financial research.
The city is just a brutally honest interface.
A model that can answer “Where is Starbucks?” is useful. A model that can solve “I need somewhere quiet to work, near me, with Wi-Fi, without walking in circles” is closer to an agent. CitySeeker shows that current VLMs are not there yet. But it also shows where to build: memory, affordance reasoning, topology, correction, and better map-view grounding.
That is less glamorous than saying “spatial intelligence has arrived.”
It is also much more useful.
Cognaptus: Automate the Present, Incubate the Future.
-
Siqi Wang et al., “CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?”, arXiv:2512.16755, https://arxiv.org/html/2512.16755. ↩︎