CitySeeker: Lost in Translation, Found in the City

The city does not answer literal questions

A person says, “I’m thirsty.”

A human does not usually reply, “Please specify whether you require a vending machine, café, convenience store, supermarket, juice shop, water fountain, or bubble tea store.” That would be technically attentive and socially catastrophic. A human looks around, remembers what cities usually contain, infers which places can satisfy the need, and starts walking toward a plausible target.

That small act is the real intelligence problem behind CitySeeker, a new benchmark for embodied urban navigation with implicit human needs.¹ It is not asking whether a vision-language model can follow “walk forward, turn left, stop at McDonald’s.” It is asking whether a model can hear “I’m thirsty,” convert that into a search space, interpret street-level visual cues, choose directions over multiple steps, avoid loops, and stop at a place that actually solves the need.

The answer, for now, is: not very well.

CitySeeker contains 6,440 trajectories across eight cities and evaluates 27 vision-language models. The top reported model, Qwen2.5-VL-32B, reaches only 21.1% task completion under the paper’s 50-meter proximity metric. Exact endpoint success is much lower, at 2.6%. Even the human baseline is not heroic, reaching 30.1% proximity completion and 5.7% exact completion, which tells us something important: this task is hard by design. But the model failures are not merely “humans are better.” They reveal where current multimodal agents still lack the kind of mundane urban intelligence that people use without naming it.

The tempting misconception is that a strong VLM plus a map should basically solve navigation. CitySeeker is useful because it quietly ruins that assumption.

The paper’s best lesson is not “models score low on a new benchmark.” We have plenty of those. The more useful lesson is the failure chain: implicit need parsing, affordance reasoning, visual grounding, waypoint decisions, memory, correction, and map-view alignment. Break any one of these and the agent becomes the urban equivalent of a very expensive tourist staring at a blue line.

Most navigation benchmarks give the agent a destination or a step-by-step route. CitySeeker removes that comfort. The user’s request may directly name a place, but it may also name a need, an attribute, or a social preference.

The benchmark organizes tasks into seven categories:

Task category	What the model must do	Example type
Basic POI	Find a directly named point of interest	nearest restaurant
Brand-specific	Recognize a named brand or chain	Starbucks
Transit hub	Locate mobility infrastructure	subway station
Latent POI	Infer a target that may be inside or attached to another place	restroom inside a mall or fast-food store
Abstract demand	Convert a human need into possible POIs	“I’m thirsty”
Inclusive infrastructure	Find a place with an accessibility attribute	accessible entrance
Semantic preference	Interpret subjective criteria	upscale or family-friendly restaurant

This design matters because implicit-need navigation is not a single problem. It is a stack of conversions.

“I want to work with Wi-Fi” is not a destination. It is a constraint bundle. The model must infer that cafés, coffee shops, libraries, bookstores, and some fast-food restaurants may be valid. It must then identify which candidates exist nearby, which are visible from street view, and whether the current route is moving toward one. A map label alone may not be enough. A storefront sign alone may not be enough. A common-sense category alone may not be enough.

This is why the paper’s benchmark construction is more than dataset bookkeeping. CitySeeker associates instructions with POI categories, builds route graphs from street-view panoramas, links visible POIs to nearby graph nodes, and manually validates trajectories. For abstract demand and latent POI tasks, the authors also run a cross-cultural consensus survey. Their pre-defined need-to-POI mappings receive an 83.39% global average consensus, while unrelated POI categories receive only 1.90%.

That survey is not the main result, but it protects the benchmark from an easy objection: “Maybe the authors invented strange mappings.” For many common needs, people broadly agree on the candidate places. If the model cannot infer them, the issue is not just cultural taste. It is a missing layer of urban affordance reasoning.

Seeing a sign is not the same as understanding what it affords

CitySeeker separates direct recognition from deeper inference, and the results behave accordingly.

Models do relatively better when the task gives them a strong lexical or visual anchor. Brand-specific navigation is easier because “Starbucks” is both a word and a visual target. The best Qwen2.5-VL-32B result in that category is 30.4% TCP, higher than its overall 21.1%.

Latent POI tasks are much harder. A restroom may not be visible as a sign on the street. It may be inside a McDonald’s, a KFC, a shopping mall, a subway station, or a public facility. That requires the model to reason through secondary functions. The paper’s error analysis gives the useful phrase here: underthinking and overthinking. A model may recognize Starbucks but fail to treat it as a place with Wi-Fi. Or it may see CVS Pharmacy and overread it as a convenience store.

This is a familiar pattern in business AI deployments. Systems often perform impressively when the target is named in the input and appears cleanly in the data. They struggle when the user names the problem rather than the solution.

That distinction matters for city agents, customer service agents, procurement assistants, internal enterprise copilots, and almost every “AI assistant” pitch deck currently wandering around the internet in a blazer. Users do not always ask for entities. They ask for outcomes. The agent must translate outcomes into candidate actions without collapsing into either literalism or fantasy.

CitySeeker makes that translation measurable in a physical environment.

The main benchmark result is a bottleneck map, not a leaderboard

The leaderboard is easy to quote but easy to misread.

Qwen2.5-VL-32B is the top overall model in the main table, with 21.1% TCP. GPT-4o reaches 18.3%; o4-mini reaches 17.9%; Gemini-2.5-Pro reaches 17.3%; InternVL3-38B reaches 19.3%. Some models underperform random choice on certain metrics or categories. Humans reach 30.1% TCP.

At first glance, this looks like another “models are bad at benchmark X” story. That is the least interesting version.

The better interpretation is that CitySeeker exposes several different failure modes that happen to appear in one task.

Evidence in the paper	Likely purpose	What it supports	What it does not prove
27-model evaluation on the full test set	Main evidence	Current VLMs struggle with implicit-need urban navigation	That every future VLM will fail similarly
Subcategory performance	Diagnostic evidence	Direct recognition is easier than latent or affordance-heavy reasoning	That category labels alone explain all failures
City-level variation	Robustness and bias probe	Geography and visual environment matter	That language localization is the main cause
Cross-lingual Beijing/New York test	Sensitivity test	Prompt language does not consistently explain city gaps	That cultural and visual bias are fully solved
Map-augmented ablation	Ablation	More map information can improve path alignment while hurting task completion	That maps are useless in deployed systems
BCR strategies on 650 samples	Exploratory extension	Memory, backtracking, and topology-aware context can improve performance	That these methods are production-ready or fully optimized
Error analysis on 300 samples	Mechanism diagnosis	Cognitive, visual, waypoint, and parsing errors differ by model	That percentages generalize exactly to all settings

This table is important because not all results carry the same weight. The full-model evaluation establishes the baseline difficulty. The appendix tests explain why simple explanations fail. The BCR studies suggest engineering directions, but they are exploratory and conducted on a smaller subset. A serious reader should not flatten all of this into one big “CitySeeker proves X” blob. We are not making soup.

Longer routes turn small errors into urban drift

In the paper’s navigation setup, the agent receives panoramic street views split into perspective views, reasons in a ReAct-style loop, chooses an action, and continues until it stops or reaches the 35-step limit. The main evaluation deliberately keeps each step independent: the model does not carry persistent memory or previous internal state across decisions.

This design choice may sound artificial, but it isolates the model’s intrinsic spatial reasoning. It also produces an important failure mechanism: error accumulation.

The paper reports that performance degrades as route length increases. Under 20 steps, trajectories are more often manageable. Around 35 steps, path-alignment scores become highly scattered. The model is not merely making one bad choice. It is losing the thread.

That loss appears in several forms:

Trajectory deviation: one wrong turn compounds into a path that no longer samples useful visual evidence.
Oscillatory detours: the agent loops or backtracks without a stable plan.
Premature stopping or overshooting: the agent stops before the target or walks past it.
Observation–reasoning mismatch: the model observes a useful cue but chooses an action inconsistent with that observation.
Malformed action output: the model fails to produce executable action fields.

This is where embodied navigation differs from answering a static image question. A single visual mistake is bad. A single wrong action in a sequential environment changes the future information available to the agent. The model does not just answer incorrectly; it walks itself into a worse epistemic position. Elegant, in a tragic little way.

For business use, this is the difference between a chatbot making one weak recommendation and an autonomous system taking a series of actions that progressively reduce recoverability. Delivery robots, mobility assistants, AR city guides, tourism agents, and field-service copilots need active correction, not just stronger first-pass perception.

The map ablation is the paper’s most useful surprise

The most business-relevant result may be the map ablation, because it attacks the obvious product-manager instinct: add a map.

In Appendix D, the authors test a map-augmented setting for GPT-4o and Qwen2.5-VL-32B. The model receives both first-person street view and a dynamically updated 2D map showing the planned route and current heading. It is instructed to analyze the map first, determine the geometrically optimal next step, and then choose the matching street-view perspective.

The result is counterintuitive. Path alignment improves, but task completion collapses.

Model	Map-free TCP	Map-augmented TCP	Map-free nDTW	Map-augmented nDTW
GPT-4o	18.3%	11.7%	136.9	75.7
Qwen2.5-VL-32B	21.1%	7.6%	147.0	54.4

The map helps the model follow a geometric route. It hurts the model’s ability to complete the semantic discovery task.

That distinction is the paper’s sharpest correction to naive deployment thinking. Navigation is not only geometry. If the user says “I’m thirsty,” the agent must keep searching for a satisfying place. A blue line can distract the model from visual exploration. The paper identifies three failure modes: weak 2D map cognition, poor alignment between map directions and first-person perspectives, and trivialization of the task into path-following.

The business implication is not “do not use maps.” That would be a silly conclusion, and silliness has enough market share already. The implication is that map integration must be treated as a representation-alignment problem. The agent must know when the map is helping route geometry and when street-level perception must override or refine it. A deployed system should not simply stuff a map screenshot into a VLM prompt and call the product “spatially aware.”

A map can improve movement while degrading purpose. That is the kind of failure customers notice immediately.

The authors test three families of exploratory strategies under the acronym BCR: Backtracking, Cognitive-map enrichment, and Retrieval-augmented memory. These experiments run on a 650-sample subset, so they should be read as exploratory engineering evidence rather than full benchmark proof.

Still, the pattern is instructive.

Backtracking mechanisms attempt to correct drift. Basic backtracking relies on low internal confidence. Step-reward backtracking uses objective topological distance. Human-guided backtracking adds a corrective hint after reverting. These strategies generally improve task completion, but simple confidence-based backtracking can hurt weaker models, suggesting that self-assessment is itself a capability. A model that is confused may also be confused about whether it is confused. Delightful.

Spatial cognition enrichment provides external spatial cues. The topology cognitive graph gives explicit connectivity between nodes and actions. The relative position map describes approximate directions and distances. The topology graph is more reliable for task success; relative-position cues can improve path efficiency but sometimes hurt completion.

Memory-based retrieval is the strongest family overall. The paper tests topology-based retrieval, spatial-based retrieval, and historical trajectory lookup. The top reported result in the accepted plan is R1 pushing Qwen2.5-VL-32B to 26.9% TCP. GPT-4o-mini also benefits: R3 reaches 19.4% TCP, while R1 sharply improves nDTW from 337.1 to 136.6.

What does this mean mechanically?

Memory changes the task from isolated local guessing into situated exploration. It gives the agent a record of visited nodes, prior decisions, confidence scores, transition histories, and recent rationales. That helps with two problems at once: avoiding repeated mistakes and reusing successful partial paths.

For businesses designing embodied or location-aware AI systems, the lesson is straightforward: do not treat memory as a personalization feature added after launch. In sequential physical tasks, memory is infrastructure. Without it, the agent keeps rediscovering the same sidewalk.

Human failure and model failure are not the same failure

The human baseline is only 30.1% TCP, so the paper avoids a simplistic “humans good, models bad” framing. Humans also struggle with unfamiliar streets, poor signage, time pressure, and the 35-step budget.

But the error analysis shows that humans and VLMs fail differently.

For humans, the dominant failure category is strategic and navigational: 60.7%. Participants often understand the request but explore inefficiently, forget paths, loop, or overshoot valid targets. For Qwen2.5-VL-32B, strategic and navigational failures are also large at 40.5%, but cognitive failures are more prominent: 32.9% for the model versus 19.1% for humans. The model struggles with non-obvious POI functions and flexible commonsense leaps. Visual and execution errors also remain material: 26.6% for the model versus 20.2% for humans.

This comparison gives a better target for product design than a single success rate.

Humans need planning aids. Models need commonsense, grounding, action consistency, and memory. A useful city assistant may eventually combine both: machine memory and routing discipline, plus human-like affordance reasoning. Current VLMs have pieces of that package, not the package.

What CitySeeker directly shows

CitySeeker directly supports four claims.

First, implicit-need urban navigation is meaningfully harder than explicit route following. It requires semantic inference, visual grounding, and long-horizon action selection.

Second, current VLMs remain weak on this task. The best overall TCP in the main benchmark is 21.1%, and exact completion is extremely low across models.

Third, adding map information naively does not solve the problem. In the tested map-augmented condition, path alignment improves while completion falls, especially for Qwen2.5-VL-32B.

Fourth, memory and correction mechanisms help. Backtracking, topology-aware context, and retrieval-based memory improve performance in exploratory subset studies, with memory-based methods showing particularly strong gains.

These are the paper’s direct contributions. They are enough.

What Cognaptus infers for business use

The business interpretation begins with a shift in product framing.

The commercially valuable task is not “navigate to POI X.” Existing maps already do a decent job there. The more interesting task is “convert vague human intent into grounded urban action.” That is relevant to at least five product categories:

Product area	CitySeeker-relevant capability	Practical design implication
Delivery and service robots	Navigate from task intent to useful physical target	Add correction loops and short-term route memory
AR city assistants	Interpret vague user needs in street context	Combine visual grounding with affordance databases
Tourism and mobility apps	Suggest nearby places from human preferences	Separate candidate generation from route execution
Accessibility tools	Find infrastructure attributes, not just place names	Treat attributes as visual and semantic evidence
Smart-city interfaces	Support dynamic, local discovery	Use topology and recent observations, not static maps alone

The inference is not that CitySeeker is a production benchmark for every deployment. It is a diagnostic prototype for a missing capability layer. In business systems, that layer would likely include:

Need-to-candidate translation: “I need Wi-Fi” becomes libraries, cafés, bookstores, fast-food restaurants, co-working spaces, and possibly malls.
Affordance scoring: each candidate gets scored by how plausibly it satisfies the user’s actual need.
Street-level evidence checking: visual cues confirm or reject candidates.
Topology-aware route memory: the agent records where it has been and which branches failed.
Correction policy: the system decides when to backtrack, ask the user, consult a map, or switch candidate categories.
Personalization layer: once general commonsense works, user preferences can rank valid options.

Notice the order. Personalization comes after general spatial commonsense. If the model cannot infer that a café may provide Wi-Fi, learning that Oliver prefers quiet cafés is not enough. A preference layer on top of weak affordance reasoning is just a more tailored mistake.

The deployment boundary is still wide

CitySeeker also defines several practical boundaries.

The benchmark does not distribute raw street-view images; it releases metadata, graph data, panorama IDs, and scripts to re-fetch imagery through official APIs. That is a sensible licensing approach, but it means reproduction depends on API access and external image availability.

The BCR strategies are promising but tested on a 650-sample subset. Combined BCR strategies improve Qwen2.5-VL-32B from 19.9% to 27.38% TCP in a preliminary experiment, but the effect is not strictly additive. Strategy interaction is still an open design problem, not a solved recipe.

Latency remains a serious issue. Real-time navigation requires repeated visual processing and sequential decisions. A city assistant that thinks beautifully but pauses like a committee at every corner will not feel intelligent to users. It will feel like a device considering retirement.

Personalization is also outside the main benchmark. The authors focus on universal common-sense mappings, not individual user history. That is appropriate for a benchmark, but deployed assistants will need behavioral priors, local culture, accessibility needs, budget, safety preferences, and time constraints.

Finally, map fusion is unresolved. The ablation does not mean maps are bad. It means current VLMs can fail to align 2D geometric guidance with first-person visual action. That is a representation problem, and probably a major one.

The useful lesson is the pipeline of failure

CitySeeker is valuable because it turns a fuzzy product aspiration—“AI agents that understand what people need in the city”—into a sequence of measurable failure points.

The model must parse the need. It must infer possible places. It must recognize visual evidence. It must choose a direction. It must remember previous choices. It must correct drift. It must know when to stop. It must use a map without becoming hypnotized by it.

That mechanism-first reading is more useful than a leaderboard summary. The benchmark does not merely say that models are weak. It shows why “more VLM” and “add a map” are incomplete answers.

For Cognaptus readers, the broader pattern should feel familiar. In many AI products, the hard part is not generating fluent text or recognizing obvious objects. The hard part is converting messy human intent into operational decisions under uncertainty. CitySeeker happens to stage that problem on sidewalks, intersections, storefronts, and malls. The same pattern appears in enterprise workflow automation, field operations, customer support, procurement, and financial research.

The city is just a brutally honest interface.

A model that can answer “Where is Starbucks?” is useful. A model that can solve “I need somewhere quiet to work, near me, with Wi-Fi, without walking in circles” is closer to an agent. CitySeeker shows that current VLMs are not there yet. But it also shows where to build: memory, affordance reasoning, topology, correction, and better map-view grounding.

That is less glamorous than saying “spatial intelligence has arrived.”

It is also much more useful.

Cognaptus: Automate the Present, Incubate the Future.

Siqi Wang et al., “CitySeeker: How Do VLMs Explore Embodied Urban Navigation with Implicit Human Needs?”, arXiv:2512.16755, https://arxiv.org/html/2512.16755. ↩︎

CitySeeker: Lost in Translation, Found in the City

The city does not answer literal questions

The task begins before navigation starts

Seeing a sign is not the same as understanding what it affords

The main benchmark result is a bottleneck map, not a leaderboard

Longer routes turn small errors into urban drift

The map ablation is the paper’s most useful surprise

Memory is not decoration; it is part of the navigation system

Human failure and model failure are not the same failure

What CitySeeker directly shows

What Cognaptus infers for business use

The deployment boundary is still wide

The useful lesson is the pipeline of failure

The city does not answer literal questions#

The task begins before navigation starts#

Seeing a sign is not the same as understanding what it affords#

The main benchmark result is a bottleneck map, not a leaderboard#

Longer routes turn small errors into urban drift#

The map ablation is the paper’s most useful surprise#

Memory is not decoration; it is part of the navigation system#

Human failure and model failure are not the same failure#

What CitySeeker directly shows#

What Cognaptus infers for business use#

The deployment boundary is still wide#

The useful lesson is the pipeline of failure#

The city does not answer literal questions

The task begins before navigation starts

Seeing a sign is not the same as understanding what it affords

The main benchmark result is a bottleneck map, not a leaderboard

Longer routes turn small errors into urban drift

The map ablation is the paper’s most useful surprise

Memory is not decoration; it is part of the navigation system

Human failure and model failure are not the same failure

What CitySeeker directly shows

What Cognaptus infers for business use

The deployment boundary is still wide

The useful lesson is the pipeline of failure