In the paper Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments, Kirtan Rajesh and Suvidha Rupesh Kumar tackle an intricate urban challenge using AI: where to place air pollution mitigation booths across a city to optimize overall air quality under multiple, conflicting objectives1.

The proposed solution uses Proximal Policy Optimization (PPO), a modern deep reinforcement learning algorithm, and a multi-dimensional reward function to model this real-world spatial optimization. But beneath the urban context lies a mathematical and algorithmic structure that holds powerful potential for business decision-making—especially where trade-offs between objectives are crucial.

The AI Under the Hood: PPO and Multi-Dimensional Reward Functions

PPO: Stabilized Policy Gradient

The PPO method used in the paper is grounded in the trust-region approach. The central loss function combines the clipped objective with value and entropy terms:

$$ L(\theta) = \mathop{\mathbb{E}}\limits_{t}\bigl[ L_{\mathrm{CLIP}}(\theta) - c_1 L_{\mathrm{value}} + c_2 L_{\mathrm{entropy}} \bigr] $$

Where:

  • $L_{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]$
  • $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$
  • $L_{\text{value}} = \frac{1}{N} \sum_t (V_\theta(s_t) - R_t)^2$
  • $L_{\text{entropy}} = -\sum_a \pi_\theta(a|s) \log \pi_\theta(a|s)$

In the original paper, the policy network $\pi_\theta$ is implemented as a CNN that takes spatial inputs (like pollution heatmaps) and transforms them into action probabilities for booth placement. Thus, $\theta$ includes all CNN weights.

Training occurs end-to-end: PPO gradients flow from the reward loss back through the CNN layers to adapt feature maps to real-world outcomes.


Generalized PPO for Business Decision-Making

This version abstracts the architecture (CNN, MLP, Transformer) and makes $s_t$ a general representation of the business environment (customer profile, market state, etc.).

Multi-Dimensional Rewards (with Constraints)

In the original paper:

$$ R(s_t, a_t) = w_1 R_{\text{local}} + w_2 R_{\text{global}} + w_3 R_{\text{pop}} + w_4 R_{\text{traffic}} + w_5 R_{\text{ind}} $$

Constraint violations are not treated as additive penalties but enforced via feasibility masks:

$$ \mathbb{I}_{\text{valid}}(a_t) = \begin{cases} 1 & \text{if action is feasible} \ 0 & \text{if constraint is violated} \end{cases} $$

Final reward becomes:

$$ R^*(s_t, a_t) = R(s_t, a_t) \cdot \mathbb{I}_{\text{valid}}(a_t) $$

Generalized Reward Function for Business

For richer modeling:

$$ R(s_t, a_t) = \Phi\left(\sum_{i=1}^n w_i R_i(s_t, a_t)\right) \cdot \mathbb{I}_{\text{valid}}(a_t) $$

Where:

  • $R_i$: Component reward (e.g., sales, brand sentiment, operational efficiency)
  • $w_i$: Dynamic or state-conditioned weight
  • $\Phi(\cdot)$: Nonlinear transformation (log, softplus, sigmoid)
  • $\mathbb{I}_{\text{valid}}$: Constraint gate (e.g., budget, regulation)

Model Variants for Business Use

1. Generalized Advantage Estimation (GAE)

Purpose: Smooth advantage estimation to reduce variance without increasing bias significantly.

Mechanism: Instead of calculating return-to-go directly, GAE computes advantage with exponential decay:

Where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. Parameters $\gamma \in [0, 1]$, $\lambda \in [0, 1]$.

Pros:

  • Provides smoother gradients
  • Improves training stability

Cons:

  • Sensitive to $\lambda$ and $\gamma$ tuning

2. Multi-Agent PPO

Purpose: Coordinate decisions from different actors (e.g., departments, retail locations).

Mechanism: Each agent $i$ has its own policy $\pi_\theta^i$, and receives its own reward $R^i$, possibly augmented by shared global objectives:

$$ R^i = R^i_{\text{local}} + \eta R_{\text{global}} $$

Pros:

  • Models distributed decision-makers
  • Encourages competition/cooperation among agents

Cons:

  • Credit assignment becomes difficult
  • Computational complexity increases significantly

3. Hierarchical PPO

Purpose: Separate strategic decisions (long-term) and tactical actions (short-term).

Mechanism: High-level policy selects sub-goals; low-level policy selects atomic actions to fulfill them:

$$ \pi_{\text{high}}(g|s), \quad \pi_{\text{low}}(a|s, g) $$

Pros:

  • Modular design fits multi-scale business needs
  • Enhances interpretability

Cons:

  • Requires reward design at both levels
  • Harder to train and debug

Full PPO + Multi-Reward Training Code (Python and R)

Python Implementation

# Define environment and policy
env = BusinessEnv()
policy = PolicyNetwork()
value_fn = ValueNetwork()
optimizer = torch.optim.Adam(policy.parameters(), lr=3e-4)

# PPO hyperparameters
EPISODES = 500
update_epochs = 10
c1, c2, eps = 0.5, 0.01, 0.2

# GAE
 def compute_GAE(traj, value_fn, gamma=0.99, lam=0.95):
    values = value_fn(traj.states)
    deltas = traj.rewards + gamma * torch.roll(values, -1) - values
    advantages = []
    adv = 0.0
    for delta in reversed(deltas):
        adv = delta + gamma * lam * adv
        advantages.insert(0, adv)
    return torch.tensor(advantages)

# Collect trajectories
 def collect_trajectory(policy, env):
    states, actions, rewards, old_probs = [], [], [], []
    state = env.reset()
    done = False
    while not done:
        prob = policy(state)
        action = sample_from(prob)
        next_state, reward, done = env.step(action)

        states.append(state)
        actions.append(action)
        rewards.append(reward)
        old_probs.append(prob[action])

        state = next_state
    return Trajectory(states, actions, rewards, old_probs)

# Training loop
for episode in range(EPISODES):
    traj = collect_trajectory(policy, env)
    advantages = compute_GAE(traj, value_fn)

    for _ in range(update_epochs):
        ratio = (policy(traj.states, traj.actions) / traj.old_probs)
        clip_obj = torch.min(ratio * advantages, torch.clamp(ratio, 1-eps, 1+eps) * advantages)
        value_loss = (value_fn(traj.states) - compute_returns(traj.rewards)).pow(2).mean()
        entropy = -(policy * torch.log(policy)).sum(dim=-1).mean()

        total_loss = -clip_obj.mean() + c1 * value_loss - c2 * entropy
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()

R Implementation

ppo_train <- function(policy, value_fn, env, episodes = 500) {
  for (ep in 1:episodes) {
    traj <- collect_trajectory(policy, env)
    adv <- compute_gae(traj, value_fn)

    for (epoch in 1:10) {
      ratio <- policy$prob(traj$states, traj$actions) / traj$old_probs
      clipped <- pmin(ratio * adv, pmax(1 - 0.2, pmin(ratio, 1 + 0.2)) * adv)
      value_loss <- mean((value_fn$predict(traj$states) - traj$returns)^2)
      entropy <- -mean(policy$entropy(traj$states))

      total_loss <- -mean(clipped) + 0.5 * value_loss - 0.01 * entropy
      policy$update(total_loss)
    }
  }
}

collect_trajectory <- function(policy, env) {
  states <- actions <- rewards <- old_probs <- list()
  state <- env$reset()
  done <- FALSE
  while (!done) {
    prob <- policy$predict(state)
    action <- sample(seq_along(prob), 1, prob = prob)
    res <- env$step(action)
    next_state <- res$state
    reward <- res$reward
    done <- res$done

    states <- append(states, list(state))
    actions <- append(actions, action)
    rewards <- append(rewards, reward)
    old_probs <- append(old_probs, prob[action])

    state <- next_state
  }
  list(states = states, actions = actions, rewards = rewards, old_probs = old_probs)
}

compute_gae <- function(traj, value_fn, gamma = 0.99, lambda = 0.95) {
  values <- value_fn$predict(traj$states)
  deltas <- mapply(function(r, v, vp) r + gamma * vp - v,
                   traj$rewards, values, c(values[-1], 0))
  adv <- numeric(length(deltas))
  adv[length(deltas)] <- deltas[length(deltas)]
  for (t in (length(deltas) - 1):1) {
    adv[t] <- deltas[t] + gamma * lambda * adv[t + 1]
  }
  adv
}

PPO in Business Setup #1: Retail Pricing Strategy

  • State ($s$): previous sales volume, stock level, seasonality
  • Action ($a$): price for SKU
  • Reward ($R$): based on margin, volume, and complaints
  • Policy: neural network outputs price probability distribution

Steps:

  1. Simulate customer response to different prices
  2. Compute reward via multi-dimensional function
  3. Estimate $\hat{A}_t$ via GAE
  4. Apply PPO update with clipped objective

PPO in Business Setup #2: Logistics Routing

  • State ($s$): includes current vehicle location, time of day, weather conditions, and package urgency level
  • Action ($a$): select the next route segment or delivery sequence to follow
  • Policy: neural network policy $\pi_\theta(a|s)$ maps state to route decision probabilities

Mathematical Reward Expression

We define the reward as a composite of delivery performance:

$$ R(s_t, a_t) = -\alpha_1 \times \text{Time}(a_t) - \alpha_2 \times \text{FuelCost}(a_t) + \alpha_3 \times \text{Reliability}(a_t) - \lambda_1 \times I_{\text{late}}(a_t) $$

Where:

  • Time($a_t$): estimated delivery time in hours
  • FuelCost($a_t$): cost in dollars
  • Reliability($a_t$): historical success rate for that path
  • $I_{\text{late}}(a_t)$: binary indicator (1 if delivery is late, 0 otherwise)
  • Example parameters: $\alpha_1 = 0.5$, $\alpha_2 = 0.3$, $\alpha_3 = 0.2$, $\lambda_1 = 50$

Example Calculation

Suppose:

  • Time = 6 hours
  • FuelCost = 120 dollars
  • Reliability = 0.95
  • $I_{\text{late}} = 0$ (on time)

Then:

$$ R = -0.5 \times 6 - 0.3 \times 120 + 0.2 \times 0.95 = -3 - 36 + 0.19 = -38.81 $$


PPO in Business Setup #3: Campaign Allocation

  • State: market segment, time of day, platform saturation, remaining budget
  • Action: choose advertising channel (e.g., TikTok, Facebook, Email)
  • Policy: neural network $\pi_\theta(a|s)$ outputs distribution over ad channels

Mathematical Reward Expression

We define the reward based on conversion efficiency, cost, and sentiment uplift:

$$ R(s_t, a_t) = \alpha_1 \times \text{ConvRate}(a_t) - \alpha_2 \times \text{CPA}(a_t) + \alpha_3 \times \text{SentimentLift}(a_t) - \lambda_1 \times I_{\text{overbudget}}(a_t) $$

Where:

  • ConvRate($a_t$): observed conversion rate after placing ads on channel $a_t$
  • CPA($a_t$): cost per acquisition (dollars)
  • SentimentLift($a_t$): increase in brand sentiment from channel $a_t$
  • $I_{\text{overbudget}}(a_t)$: indicator for exceeding budget (1 if exceeded, else 0)
  • Typical values: $\alpha_1 = 0.5$, $\alpha_2 = 0.4$, $\alpha_3 = 0.1$, $\lambda_1 = 100$

Example Calculation

Suppose:

  • ConvRate = 0.04 (4%)
  • CPA = 25 USD
  • SentimentLift = 3.2 (on a scale of -5 to +5)
  • No budget overrun ($I_{\text{overbudget}} = 0$)

Then:

$$ R = 0.5 \times 0.04 - 0.4 \times 25 + 0.1 \times 3.2 = 0.02 - 10 + 0.32 = -9.66 $$


Final Thoughts

By borrowing mathematical structures from urban optimization—especially PPO with constraint-aware, multi-dimensional rewards—businesses can model decisions in richly nuanced ways. This approach balances multiple goals, adapts through learning, and ultimately leads to better, data-driven strategy formation.


  1. arXiv:2505.00668 Deep Reinforcement Learning for Urban Air Quality Management: Multi-Objective Optimization of Pollution Mitigation Booth Placement in Metropolitan Environments Kirtan Rajesh, Suvidha Rupesh Kumar ↩︎