Cities That Think: Reasoning AI for the Urban Century

Zoning is where optimism goes to meet the municipal code.

A proposed housing site may look perfect on a dashboard: good transport access, strong demand, reasonable land cost, favourable development projections. Then the real planning work begins. Height restrictions appear. Environmental buffers interfere. Community priorities conflict. A flood-risk layer changes the cost-benefit story. A transport engineer likes the site. A housing officer likes the urgency. A neighbourhood group likes neither the density nor the traffic. The question is no longer “what is likely to happen?” It is “what should be allowed, under which constraints, with what trade-offs, and who can justify that decision in public?”

That is the gap Sijie Yang, Jiatong Li, and Filip Biljecki target in Reasoning Is All You Need for Urban Planning AI.¹ Despite the title’s heroic simplicity — every AI paper eventually discovers the marketing department — the argument is not that cities need one magical reasoning model. It is more sober and more useful: urban-planning AI has been strong at analysis, but decision support requires a different machinery. Prediction is not enough when the output must be legal, explainable, value-sensitive, and open to human challenge.

The paper is a position and framework paper, not an empirical benchmark report. There are no performance tables showing that a deployed planning agent beats human planners, traditional optimisation, or existing planning-support systems. Its contribution is architectural: it proposes an Agentic Urban Planning AI Framework with three cognitive layers, six logic components, a human-AI multi-agent collaboration pipeline, and evaluation metrics for future systems. Read properly, the paper is less a product demo than a wiring diagram for the next generation of planning copilots.

That distinction matters. A wiring diagram does not prove the machine works. But it does tell us where the machine would have to be wired if anyone wants it to survive contact with planning law, stakeholder politics, and the joyful spreadsheet swamp of municipal governance.

The planning problem is not prediction with nicer charts

Urban AI has already done plenty of useful work in analytics. The paper lists familiar examples: traffic prediction, land-use classification, building-carbon-emission forecasting, urban heat-island assessment, thermal comfort evaluation, morphology analysis, street activity detection, and liveability indices. These are largely pattern-recognition and prediction tasks. They ask systems to learn from observed data and estimate likely conditions.

Planning decisions ask for something nastier.

A planning authority does not merely want to know whether a site is likely to generate congestion. It must decide whether the proposal is permissible, whether mitigation is adequate, whether the distribution of benefits and harms is defensible, and whether the reasoning can be inspected by people who did not train the model and may not trust it. This is not a minor user-interface problem. It changes the nature of the AI system.

The authors frame the difference as a shift from statistical learning to reasoning agents. Statistical learning systems learn patterns from historical data. They can recommend, classify, forecast, and detect likely violations. Reasoning agents, by contrast, are expected to generate explicit reasoning traces, use external tools, verify constraints, deliberate over values, and explain why one proposal is preferable to another.

The paper’s Table 1 is a conceptual comparison, not experimental evidence. Its purpose is to separate decision tasks into three categories where reasoning matters:

Planning requirement	What statistical learning can often do	What reasoning agents are meant to add
Value-based judgement	Learn historical allocations or stakeholder preferences	Apply explicit principles, challenge biased precedent, reason from first principles
Rule-grounded compliance	Detect likely violations or find feasible-looking options	Verify hard constraints and resolve rule conflicts
Explainability	Produce recommendations and predictions	Provide readable rationales, causal chains, and counterfactual explanations

This is the paper’s central correction to a common misconception. Reasoning agents are not proposed as replacements for planners. Nor do they automatically make planning lawful or fair. They are proposed as decision-support machinery for situations where a black-box recommendation is institutionally useless, even if it is statistically impressive.

A city cannot approve a controversial project by saying, “the model liked it.” Well, it can try. Then it can enjoy the litigation.

The proposed system has three layers before it has opinions

The strongest part of the paper is its mechanism. The authors do not present “AI for planning” as a single chatbot sitting on top of a GIS map. They decompose the system into three cognitive layers: Perception, Foundation, and Reasoning.

This layering is important because it stops the discussion from collapsing into model fandom. The planning agent does not begin by “thinking.” It begins by seeing, retrieving, structuring, and modelling.

The Perception Layer handles the city as observable data. Satellite imagery, street-view photos, spatial layers, 3D reconstructions, and multimodal representations become machine-interpretable inputs. The authors map technologies such as SAM and ViT to visual extraction, CLIP and BLIP-2 to vision-language alignment, and NeRF or 3D Gaussian Splatting to 3D urban representation. The point is not that these specific models are sacred. They are examples of the machinery required to convert messy urban reality into data that downstream systems can reason over.

The Foundation Layer organises knowledge. Here the paper places statistical learning, LLMs, retrieval-augmented generation, simulation, and reinforcement learning. XGBoost and SHAP represent interpretable prediction. LLMs parse planning documents and policy language. RAG retrieves relevant regulations, guidelines, precedents, and cases. Simulation and reinforcement learning provide environment models and policy-learning components.

The Reasoning Layer performs explicit decision work. This is where Chain-of-Thought and Tree-of-Thought decompose problems, ReAct-style agents call tools, symbolic solvers check compliance, multi-agent systems coordinate specialised roles, and value-aligned methods try to keep normative criteria visible.

The paper summarises the architecture with a useful formula-like slogan:

Agentic AI = LLM (Core) + RAG (Memory) + Tools (Action) + RL (Feedback) + Values (Constraint)

As slogans go, this one earns its coffee. It makes clear that the LLM is not the whole system. In a serious planning application, the model is an orchestration engine sitting among knowledge retrieval, verification tools, feedback loops, and value constraints. A chatbot alone is not urban intelligence. It is at best a fluent intern with zoning anxiety.

The six logic components turn an agent into a planning workflow

The three layers explain what the system is built from. The six logic components explain what the system does.

The authors define six components: Analysis, Generation, Verification, Evaluation, Collaboration, and Decision. This is where the framework becomes operational.

Analysis extracts features, interprets the planning context, retrieves comparable cases, diagnoses problems, and identifies opportunities. This is not yet a recommendation. It is the system assembling the conditions under which recommendations can be meaningful.

Generation creates planning alternatives. The important word is alternatives. A planning copilot should not simply produce one plausible scheme and ask for applause. It should explore a solution space: different sites, densities, mitigation packages, public-space allocations, phasing strategies, or infrastructure options.

Verification checks hard constraints. This is the most commercially underappreciated part of the framework. For many planning and real-estate workflows, the valuable AI product is not the one that writes the most elegant policy memo. It is the one that says: this proposal fails setback rule X, environmental buffer Y, and parking/loading requirement Z before a team spends three weeks decorating a doomed PowerPoint.

Evaluation scores proposals against soft objectives such as equity, sustainability, resilience, liveability, and economic impact. These are not “soft” because they are unimportant. They are soft because they require weighting, judgement, and contestable priorities. The paper formalises this as multi-objective optimisation, where hard constraints must be satisfied and soft objectives are weighted by stakeholder preferences.

Collaboration structures human and multi-agent review. The system can support linear individual review, where specialists evaluate a proposal sequentially, or group discussion, where stakeholders deliberate collectively. The framework includes planners, scientists, citizens, analysts, and policymakers as possible reviewing roles.

Decision synthesises the reasoning chain and presents accept-or-revise options. This final component is explicitly not full automation. The human-AI interface remains central, because planning legitimacy does not come from computational fluency. It comes from accountable decision-making.

The paper’s Figure 2 is therefore best read as the main architecture diagram. Figure 3 is the collaboration workflow. Algorithm 1 is an implementation sketch, showing how requirements, retrieved knowledge, generated proposals, symbolic verification, impact scoring, role-based review, conflict detection, and final refinement could be chained. These are not ablations or robustness tests. They are design artefacts: useful for understanding system intent, but not evidence that the design has been validated in production.

The formalisation says: optimise, but only after the law says yes

The paper formalises urban planning as a constrained multi-objective optimisation problem with explicit reasoning requirements. A planning context $C = \langle D, K, S \rangle$ contains spatial data $D$, planning knowledge $K$, and stakeholder input $S$. A set of hard constraints $H = {h_1, h_2, \ldots, h_m}$ captures regulatory requirements. A set of soft objectives $O = {o_1, o_2, \ldots, o_n}$ captures normative criteria such as equity, sustainability, and liveability.

The system seeks a proposal $p$ and reasoning chain $r$ that maximise weighted objectives:

$$ p^\ast, r^\ast = \arg\max_{p \in P, r \in R} \left[\sum_{i=1}^{n} w_i \cdot o_i(p)\right] $$

subject to every hard constraint being satisfied:

$$ \forall h_j \in H: h_j(p) = \text{True} $$

The reasoning chain must also be:

$$ Valid(r) \land Complete(r) \land Traceable(r, p, C) $$

This is the conceptual heart of the framework. The system is not supposed to optimise first and explain later. It is supposed to optimise under constraints while maintaining a reasoning chain that humans can inspect.

In business terms, that changes the product category. A traditional analytics product says: “Here is the predicted outcome.” A reasoning-based planning product says: “Here are the feasible options, here are the constraints each one satisfies, here are the trade-offs, here is the rationale, and here is where human judgement must decide.” The second product is harder to build. It is also closer to how planning decisions are actually made.

The evaluation appendix is a measurement proposal, not proof

The appendix proposes evaluation metrics for reasoning-capable planning agents. It is tempting to treat metrics as results because they come with equations. That would be a mistake. The metrics are a benchmark scaffold for future work, not an empirical validation of the framework.

The proposed metrics map onto the six pipeline stages:

Evaluation dimension	Example metric	Likely purpose	What it does not prove
Analysis	Feature extraction accuracy; contextual understanding	Tests whether the system correctly interprets urban context	Does not prove good planning judgement
Generation	Proposal diversity; reasoning-chain coherence; generation time	Measures whether useful alternatives and explanations are produced	Does not prove generated schemes are politically acceptable
Verification	Constraint satisfaction rate; violation rate; verification latency	Tests regulatory compliance and speed	Does not prove all real-world legal ambiguity is resolved
Evaluation	Value alignment score; equity impact; principle adherence	Measures alignment with stated objectives	Does not settle contested values
Collaboration	Collaboration efficiency; feedback incorporation; stakeholder comprehension	Tests human-AI interaction quality	Does not guarantee democratic legitimacy
Decision	Decision quality score; explanation completeness; agreement with planners	Summarises final output quality	Does not prove the AI should be the decision-maker

The formulas are sensible as a starting point. Constraint Satisfaction Rate measures the proportion of hard constraints satisfied. Reasoning Chain Quality combines coherence, completeness, and traceability. Value Alignment Score normalises objective performance. Human-AI Collaboration Efficiency penalises excessive interaction cycles and time. Decision Quality Score combines constraint satisfaction, reasoning quality, and value alignment.

But these metrics also expose the hardest unresolved issue: who defines the scoring function?

A value alignment score is only as legitimate as the value-elicitation process behind it. A proposal can score well against “sustainability” while worsening affordability. It can improve accessibility while accelerating displacement. It can satisfy formal zoning while violating community expectations. Metrics help make trade-offs explicit; they do not make trade-offs disappear. The spreadsheet may be very democratic-looking. That does not make it democracy.

The business value is auditable decision support, not automated city hall

For govtech vendors, planning consultancies, real-estate developers, infrastructure firms, and digital-twin platforms, the paper points to a practical product shift: from predictive dashboards to auditable planning copilots.

The near-term commercial value is likely to sit in five workflows.

First, machine-readable planning knowledge. Zoning codes, design guidelines, environmental requirements, transport standards, heritage constraints, and local ordinances need to be retrieved, parsed, and represented in forms that agents and solvers can use. This is boring infrastructure. Naturally, it is also where much of the value hides.

Second, proposal generation under constraints. Developers and public agencies often need to compare multiple feasible options quickly: different densities, site layouts, phasing choices, or mitigation strategies. An agent that can generate alternatives while respecting constraints has more value than a model that simply produces one attractive concept image. Pretty renders are abundant. Legally viable options are less so.

Third, compliance checking and early rejection. The Verification component could reduce wasted design, legal, and consultant time by filtering infeasible proposals early. This is where reasoning AI may produce measurable ROI before anyone lets it near a final public decision.

Fourth, explainable trade-off reporting. Planning decisions require narratives: why this option, why not that one, which rules applied, which objectives were prioritised, and what harms remain. A system that can assemble traceable justifications from regulations, spatial data, impact models, and stakeholder feedback could improve internal review and public communication.

Fifth, structured stakeholder feedback. The collaboration framework suggests AI can capture comments, ratings, objections, and revision requests, then map them back into constraints or objectives. That could help agencies manage deliberation at scale. It could also create a beautiful new way to launder decisions through “consultation analytics” if governance is weak. Tools rarely arrive with ethics pre-installed.

The business inference is clear but bounded. The paper directly proposes a framework. Cognaptus infers that the strongest commercial applications will be compliance-aware planning copilots and decision-audit systems. What remains uncertain is whether these systems can handle local legal ambiguity, incomplete data, adversarial stakeholder dynamics, and liability-sensitive deployment.

The uncomfortable dependency is formalised regulation

The framework’s most important dependency is not model size. It is formalisation.

For the Verification component to work, planning rules must be encoded in machine-interpretable form. That is hard. Zoning codes contain exceptions, discretionary clauses, vague standards, cross-references, outdated language, and local interpretations. Environmental rules may depend on external studies. Heritage decisions may rely on judgement. Community-benefit requirements may be negotiated rather than mechanically computed.

The paper names constraint knowledge formalisation as its first open research challenge, and rightly so. Without formalised constraints, the reasoning agent becomes a confident summariser of messy documents. Useful, perhaps. Verifiable, no.

The second dependency is reasoning quality. LLMs can produce plausible chains of thought that are incomplete, circular, or simply wrong. The paper’s insistence on verification is therefore not a decorative safety feature. It is the difference between a planning assistant and a litigation generator with a friendly interface.

The third dependency is scalability. Real planning problems can involve thousands of constraints and many rounds of revision. A prototype that reasons beautifully over a toy zoning example may collapse when asked to process a metropolitan redevelopment plan, multiple agencies, and a public consultation record thick enough to stun livestock.

The fourth dependency is value alignment. The authors correctly argue that reasoning agents should be able to challenge historical bias rather than merely reproduce it. But this depends on explicit value elicitation, stakeholder representation, auditing of hidden assumptions, and governance rules about who gets to set the weights. Otherwise, the system may simply encode politics in a more technical accent.

The paper is strongest as an agenda, not as evidence

The paper’s evidence base is primarily conceptual and architectural. Figure 1 distinguishes analytics from decision support. Table 1 compares statistical learning and reasoning agents across planning tasks. Table 2 maps AI methods to planning functions. Figure 2 presents the three-layer, six-component architecture. Figure 3 explains human-AI collaboration modes. Algorithm 1 sketches the pipeline. The appendix proposes evaluation metrics.

None of these is an experiment. There is no deployed system, no user study, no benchmark dataset, no ablation comparing agents with and without symbolic verification, and no field evidence showing improved planning outcomes. That does not make the paper weak. It makes it a position paper with the obligations of a position paper: define the problem, propose a framework, and make the research agenda sharper.

On that standard, the paper is useful. It gives the urban AI conversation a better unit of analysis. Instead of asking whether “AI can plan cities” — a question so broad it should be fined for loitering — it asks what capabilities planning decision support requires: perception, knowledge retrieval, generation, verification, evaluation, collaboration, and decision justification.

The next research step is obvious: build benchmark tasks where hard constraints are annotated, expert reasoning chains are available, stakeholder trade-offs are explicit, and proposed systems can be compared under realistic planning scenarios. Until then, the paper should be read as an architecture for evaluation, not an evaluated architecture.

What serious adopters should take from it

The safe managerial takeaway is not “replace planners with agents.” That would be a category error wearing a strategy badge.

The sharper takeaway is this: planning AI must become auditable before it becomes powerful. A system that predicts demand but cannot explain legal compliance is incomplete. A system that generates beautiful alternatives but cannot verify constraints is risky. A system that optimises weighted values without showing who chose the weights is politically fragile. A system that invites stakeholder feedback but cannot trace how that feedback changed the proposal is consultation theatre with GPU bills.

For public agencies, the immediate opportunity is to invest in machine-readable planning rules, structured geospatial data, and audit protocols. For consultancies, it is to build reasoning workflows that reduce review time and improve the defensibility of recommendations. For developers, it is early-stage feasibility analysis that catches regulatory failure before design costs compound. For infrastructure platforms and digital-twin vendors, it is integrating predictive models with rule engines, retrieval systems, and human review loops.

The paper’s title says reasoning is all you need. The paper itself quietly says something better: reasoning is the missing layer between urban analytics and accountable decision support.

Cities do not need AI that merely predicts them. They need systems that can show their work, respect constraints, expose trade-offs, and leave room for human judgement. That is less glamorous than autonomous urban intelligence. It is also far more likely to survive a planning committee.

Cognaptus: Automate the Present, Incubate the Future.

Sijie Yang, Jiatong Li, and Filip Biljecki, “Reasoning Is All You Need for Urban Planning AI,” arXiv:2511.05375, 2025, https://arxiv.org/abs/2511.05375. ↩︎

The planning problem is not prediction with nicer charts#

The proposed system has three layers before it has opinions#

The six logic components turn an agent into a planning workflow#

The formalisation says: optimise, but only after the law says yes#

The evaluation appendix is a measurement proposal, not proof#

The business value is auditable decision support, not automated city hall#

The uncomfortable dependency is formalised regulation#

The paper is strongest as an agenda, not as evidence#

What serious adopters should take from it#