Introduction: The Retail Map Is Not the Territory

In fast-growing cities like Nairobi, Jakarta, or Lagos, deciding where to plant the next store is less about gut feeling and more about navigating an entangled network of demand, accessibility, cost, and government regulations. At Cognaptus, we developed a multi-layered AI-driven framework that not only mimics real-world logistics but also learns and forecasts future retail viability.

This article explores how we combined predictive analytics, geospatial clustering, graph theory, and multi-objective optimization to determine where new retail nodes should thrive — balancing today’s needs with tomorrow’s complexities.

Before diving into our proposed framework, we compared several location optimization strategies:

Approach Pros Cons
Heuristic Rules + GIS Simple, easy to interpret Static, lacks future forecasting
Deep Learning Heatmaps Can find hidden demand drivers Often opaque, expensive, lacks interpretability
Optimization-Only (MIP/ILP) Precise constraint modeling Often inflexible, sensitive to assumptions
Hybrid: Predictive + Heuristic Balances data-driven logic with domain expertise Scalable, adaptable, explainable

We opted for the hybrid approach, combining unsupervised pattern recognition (via DBSCAN), human-centric filtering (via POIs and zoning), and forecasting with neural nets (LSTM). This approach avoids the rigidity of optimization-only methods and the opaqueness of deep end-to-end systems. It also allows for direct human input and validation during intermediate steps — which is especially valuable in emerging markets with limited data reliability.

This strategy draws inspiration from the EV charging study in NSW, Australia by Li et al. (2025) , where DBSCAN was used alongside environmental and infrastructure data to identify safe and accessible deployment zones. We adapt that logic here for the retail context.

Why R and Python Together? We deliberately chose R for spatial operations (via sf, leaflet, mco) and Python for deep learning tasks (via Keras/LSTM). R provides more elegant and direct geospatial handling for policy interfaces, while Python offers broader support and GPU acceleration for neural forecasting. This hybrid stack combines the best of both ecosystems without overengineering a full-stack solution.

The Problem: From Static Maps to Dynamic Demand

Traditional retail expansion strategies often overlook:

  • Zoning regulations and title clearance issues
  • Seasonality of foot traffic (e.g., school calendars, holidays)
  • Flood or fire-prone zones
  • Unserved but high-growth pockets due to informal settlements
  • Future demand spikes driven by urban migration and population growth

A smarter approach needs more than heatmaps — it needs predictive vision and adaptive decisions.

Framework Overview

We propose a hybrid system structured as follows:

Module 1: Data Loading

This module ingests heterogeneous spatial, mobility, and socio-economic data. Accurate harmonization of coordinate systems and geospatial referencing is critical.

load_data <- function() {
  poi <- read.csv("data/nairobi_pois.csv")
  gps <- readRDS("data/trip_logs.rds")
  road_net <- sf::st_read("data/road_network.geojson")
  pop_growth <- read.csv("data/population_forecast.csv")
  zoning <- read_sf("data/zoning_boundaries.shp")
  return(list(poi=poi, gps=gps, roads=road_net, pop=pop_growth, zone=zoning))
}

We collected real city data from Nairobi’s open sources, including:

  • Nairobi POIs from OpenStreetMap
  • Trip data logs aggregated by a local delivery service (anonymized)
  • Road networks from Kenya Urban Roads Authority
  • Population forecasts from the Kenya National Bureau of Statistics
  • Zoning overlays from the Nairobi County Development Authority

Module 2: DBSCAN Clustering with Spatial Constraints

Theoretical Background: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that identifies dense areas in spatial data. Instead of assuming a fixed number of clusters (as K-means does), DBSCAN expands clusters based on point density.

  • In academic terms, DBSCAN identifies core points (with sufficient neighbors), border points (reachable from core), and noise.

  • In everyday terms, it groups together areas where people or vehicles congregate repeatedly — ignoring scattered outliers.

Why DBSCAN here?

  • It handles irregularly shaped clusters, ideal for urban trip paths.

  • It’s robust to noise, which is important when trip data includes detours or deliveries.

  • It outperforms techniques like K-means for spatial event detection, and unlike hierarchical clustering, it scales well to large datasets.

We enhance DBSCAN by layering zoning filters and adjusting eps and minPts dynamically using POI and road proximity.

dbscan_enhanced <- function(gps_data, eps = 0.005, minPts = 10, poi, road_net, zoning) {
  library(sf)
  library(dbscan)
  library(dplyr)

  # Prepare spatial data
  gps_sf <- st_as_sf(gps_data, coords = c("lon", "lat"), crs = 4326)
  gps_transformed <- st_transform(gps_sf, crs = 3857) # Projected CRS for accurate distance calculations
  coords <- st_coordinates(gps_transformed)

  # Run DBSCAN clustering
  clust_result <- dbscan(coords, eps = eps * 100000, minPts = minPts) # Convert eps to meters
  gps_transformed$cluster_id <- clust_result$cluster

  # Filter out noise (cluster_id = 0)
  clustered <- gps_transformed %>% filter(cluster_id > 0)

  # Join with zoning and POI layers to enforce policy and relevance constraints
  clustered <- st_join(clustered, zoning, join = st_within)
  clustered <- clustered %>% filter(!is.na(zoning_code)) # Remove points outside commercial zones

  # Calculate cluster centroids
  cluster_summary <- clustered %>% 
    group_by(cluster_id) %>% 
    summarise(geometry = st_centroid(st_union(geometry)), .groups = 'drop') %>% 
    st_join(poi, join = st_is_within_distance, dist = 500) %>%  # Proximity to POI
    mutate(poi_score = n()) %>% 
    arrange(desc(poi_score))

  return(cluster_summary)
}

Explanation

This enhanced dbscan_enhanced function performs the following:

  1. Preprocessing: Converts raw longitude/latitude trip logs into an sf spatial object and reprojects it into a projected CRS (EPSG:3857) for accurate distance clustering.
  2. DBSCAN Clustering: Applies DBSCAN to spatial coordinates, with eps scaled to meters (approximate conversion factor based on degrees to meters).
  3. Filtering: Removes noise points and then intersects remaining points with zoning layers. Only points in commercial or mixed-use zones are retained.
  4. Relevance Scoring: For each cluster, it calculates a geometric centroid, then measures proximity to Points of Interest (within 500 meters), generating a basic POI score.
  5. Ranking: Outputs a ranked list of cluster centroids based on how closely they align with known urban POIs.

This process simulates how a city planner might scan GPS logs, check compliance with policy constraints, and prioritize zones near activity hubs — but does so algorithmically, scalably, and reproducibly.

Module 3: POI-Constrained Filtering

Why POIs? Points of Interest (markets, schools, bus terminals) reflect areas of human activity. Filtering clusters based on POI density ensures candidate store locations are practical and walkable.

filter_clusters <- function(clust, poi, roads, zoning) {
  library(sf)
  library(dplyr)

  # Convert cluster centroids to sf object
  clust_sf <- st_as_sf(clust, coords = c("lon", "lat"), crs = 4326)
  clust_sf <- st_transform(clust_sf, crs = 3857)

  # Join with zoning info and retain only commercial or mixed zones
  clust_zoned <- st_join(clust_sf, zoning, join = st_within) %>%
    filter(zoning_code %in% c("COMMERCIAL", "MIXED_USE"))

  # Calculate road distance proximity (e.g., must be within 150m of nearest road)
  clust_near_roads <- st_is_within_distance(clust_zoned, roads, dist = 150)
  clust_zoned <- clust_zoned[lengths(clust_near_roads) > 0, ]

  # Join with POIs and compute count of nearby POIs within 300m
  clust_poi <- st_join(clust_zoned, poi, join = st_is_within_distance, dist = 300)
  poi_score <- clust_poi %>% group_by(cluster_id) %>% summarise(poi_count = n())

  # Merge back to original and rank
  result <- clust_zoned %>% left_join(poi_score, by = "cluster_id") %>% 
    mutate(poi_count = ifelse(is.na(poi_count), 0, poi_count)) %>%
    arrange(desc(poi_count))

  return(result)
}

Explanation

This filter_clusters function refines DBSCAN outputs by layering spatial and urban relevance constraints:

  1. Zoning Filtering ensures recommended points comply with local planning rules (commercial or mixed-use).
  2. Road Proximity Filtering excludes clusters isolated from the road network, which could hinder accessibility or increase setup costs.
  3. POI Proximity Scoring counts nearby Points of Interest (within 300 meters), estimating local footfall potential.
  4. Ranking clusters by POI scores prioritizes those embedded in dense urban ecosystems, making them more viable for retail deployment.

This filter mimics a planner’s checklist for site feasibility but implements it at scale using geospatial logic.

Module 4: Demand Forecasting with LSTM

Long Short-Term Memory models are ideal for learning from temporal data. They outperform simpler models like ARIMA in capturing nonlinear patterns, seasonality, and structural shifts.

from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.optimizers import Adam

def build_lstm_model(input_shape):
    model = Sequential()
    model.add(LSTM(64, input_shape=input_shape, return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(32))
    model.add(Dropout(0.2))
    model.add(Dense(1))
    model.compile(optimizer=Adam(learning_rate=0.001), loss='mse')
    return model

Explanation

This build_lstm_model function defines a deep learning architecture for time-series forecasting of retail demand at candidate locations. Here’s how it works:

  1. Model Structure:

    • Two LSTM layers capture long-term and short-term patterns in past data (like weekly, holiday, or seasonal variations).
    • Dropout layers act as regularization to prevent overfitting.
    • A final dense layer outputs the predicted demand volume for each site.
  2. Why LSTM?

    • Unlike linear models or ARIMA, LSTMs can handle nonlinear and irregular temporal signals — such as sudden spikes due to a nearby school opening or a market fair.
    • They maintain a memory of sequential context, ideal for modeling changes in urban foot traffic or logistics flow.
  3. Inputs might include:

    • Historical sales or delivery counts
    • Temporal features (e.g., day of week, holidays)
    • Event-based tags or local disruptions (e.g., flooding, road closures)

In production, we train this model separately for high-interest candidate zones, using backtesting to validate forecast accuracy before feeding demand projections into the multi-objective optimizer.

Module 5: Multi-objective Optimization

Theoretical Background: NSGA-II NSGA-II (Non-dominated Sorting Genetic Algorithm II) is a multi-objective evolutionary algorithm that optimizes several competing goals simultaneously — producing a Pareto front instead of a single solution.

  • In academic terms, NSGA-II evolves a population of candidate solutions through selection, crossover, and mutation. It ranks them using Pareto dominance and diversity preservation (crowding distance).

  • In practical terms, it generates a range of trade-off solutions, e.g., cheap but far vs. expensive but central.

Why NSGA-II here?

  • Traditional solvers require combining all objectives into a single function — but that’s hard when trade-offs matter (e.g., cost vs. coverage).

  • NSGA-II gives decision-makers flexibility to choose a solution based on context (e.g., political preference for underserved zones).

We use it to balance:

  • Max demand coverage (from LSTM forecasts)

  • Min risk (flood/fire/crime exposure)

  • Max accessibility (road and POI reach)

  • Min cost (zoning premium, construction)

The outcome is a frontier of “best possible” options, not a single recommendation.

library(mco)

# Define multi-objective evaluation function
# Each candidate represents a flattened vector of lat/lon pairs for n sites
# For simplicity, mock functions are used below, but in practice, you'd load forecasted demand and geospatial risk layers

evaluate_site_set <- function(candidate_vector) {
  n <- length(candidate_vector) / 2
  candidate_coords <- matrix(candidate_vector, ncol = 2, byrow = TRUE)

  # Convert to spatial object
  candidate_sf <- st_as_sf(as.data.frame(candidate_coords), coords = c(1,2), crs = 4326)

  # Objective 1: Maximize forecast demand (use pre-trained LSTM output layer or interpolated demand grid)
  demand_score <- sapply(1:n, function(i) runif(1, 0.5, 1.0))  # Replace with model.predict()
  obj1 <- -sum(demand_score)  # Minimize negative demand to maximize it

  # Objective 2: Minimize risk score (e.g., flood or fire)
  risk_score <- sapply(1:n, function(i) runif(1, 0.1, 0.6))  # Replace with spatial join to raster or polygonal risk data
  obj2 <- mean(risk_score)

  # Objective 3: Minimize average road distance to nearest major road
  obj3 <- runif(1, 100, 500)  # Replace with st_distance to major road layer

  # Objective 4: Minimize total deployment cost (e.g., based on zoning or land prices)
  obj4 <- runif(1, 500000, 1200000)  # Replace with land valuation data or zone classification penalty

  return(c(obj1, obj2, obj3, obj4))
}

# Configure optimizer
num_sites <- 5
result <- nsga2(
  fn = evaluate_site_set,
  idim = num_sites * 2, odim = 4,
  lower.bounds = rep(c(36.8, -1.4), num_sites),   # Example bounds: Nairobi bounding box
  upper.bounds = rep(c(37.1, -1.2), num_sites),
  popsize = 100, generations = 80
)

Explanation

This NSGA-II implementation searches for optimal combinations of retail site coordinates that balance four objectives:

  1. Forecasted Demand Coverage – Based on prior demand predictions from LSTM or a demand surface.
  2. Disaster or Environmental Risk – Based on flood, fire, or crime exposure at each site.
  3. Accessibility to Infrastructure – Measured as proximity to major roads or logistics networks.
  4. Deployment Cost – Based on zoning, land cost, or other development barriers.

Each candidate is a flattened vector of latitude/longitude pairs representing proposed store locations. NSGA-II evolves a population of these candidates, exploring tradeoffs between cost, coverage, and safety — and yielding a Pareto front of optimal solutions for policymakers or retail strategists to choose from.

Module 6: Visual Output (Updated for Shiny)

We use a Shiny dashboard to render interactive maps, KPIs, and scenario toggles. Users can:

  • Switch between current vs. forecasted demand layers
  • Toggle constraints (e.g., zoning strictness, disaster overlays)
  • Compare Pareto-front solutions visually
library(shiny)
library(leaflet)

ui <- fluidPage(
  titlePanel("Smart Store Location Explorer - Nairobi"),
  sidebarLayout(
    sidebarPanel(
      sliderInput("scoreThreshold", "Risk Score Limit:", min = 0, max = 1, value = 0.3),
      checkboxInput("showForecast", "Show LSTM Forecast Layer", value = TRUE)
    ),
    mainPanel(
      leafletOutput("map"),
      tableOutput("scoreTable")
    )
  )
)

server <- function(input, output) {
  output$map <- renderLeaflet({
    leaflet() %>% addTiles() %>% addCircleMarkers(lng = clusters$lon, lat = clusters$lat)
  })
  output$scoreTable <- renderTable({ head(cluster_scores) })
}

shinyApp(ui, server)

Final Thoughts

Choosing a retail site isn’t about planting a flag — it’s about planting roots where demand can grow, access is viable, and risks are manageable. AI doesn’t replace human judgment here — it augments it, especially when data is complex, fragmented, or future-facing.

This framework, though tested on Nairobi, is scalable to any city where growth outpaces infrastructure — from Bangalore to Bogotá.


Cognaptus: Automate the Present, Incubate the Future.