Green Lights, Smarter Cities: How Multi‑Agent Reinforcement Learning Is Rewiring Urban Traffic

Opening — Why this matters now

Every modern city has the same quiet enemy: the traffic light.

Not the hardware itself, of course, but the logic behind it. Most intersections still run on pre‑programmed schedules designed by traffic engineers years earlier. Rush hour arrives, a lane unexpectedly fills, and the light calmly continues its fixed cycle—green for empty roads, red for congested ones.

With urban congestion costing economies billions annually and drivers losing tens of hours per year in delays, the promise of AI‑controlled traffic systems has become increasingly attractive. Reinforcement learning (RL) has been studied for this purpose for nearly a decade.

Yet there is a catch: most RL traffic systems work beautifully in simulations—and then collapse in the real world.

The research behind this article proposes a Multi‑Agent Reinforcement Learning (MARL) framework designed specifically to close that gap. Instead of training traffic lights to memorize patterns, it teaches them to handle uncertainty, coordinate with neighbors, and adapt continuously.

In other words: traffic lights that actually think.

Background — Why existing AI traffic systems struggle

Reinforcement learning treats each traffic signal as an agent interacting with its environment. The agent observes traffic conditions, chooses signal actions, and receives rewards based on metrics such as travel time or congestion.

Conceptually, this is elegant. Practically, it is messy.

Three structural problems have limited deployment.

1. Overfitting to static traffic patterns

Many RL systems train using fixed traffic flows and turning patterns. The result is a policy that memorizes timing patterns rather than learning traffic dynamics.

When real‑world traffic changes—morning peaks, accidents, weather disruptions—the trained model fails to adapt.

2. Unsafe or unstable action spaces

Traffic signals must follow cyclic phase sequences for safety reasons. Some RL models allow arbitrary phase switching, which may violate real‑world operational constraints or cause oscillating signals.

3. Scalability problems

Large traffic networks require coordination across many intersections.

Two naive observation strategies exist:

Observation scope	Advantage	Problem
Local	Scalable	Cannot coordinate traffic waves
Global	Optimal information	Computationally infeasible

What cities actually need is a middle ground.

Analysis — The proposed MARL architecture

The framework introduces three technical ideas that together produce a robust traffic‑control system.

Component	Purpose	Business / system implication
Turning Ratio Randomization	Prevent training overfitting	Improves real‑world adaptability
Exponential Phase Adjustment	Stabilize signal timing	Smooth response to congestion
Neighbor‑based CTDE coordination	Enable scalable cooperation	Works across city‑scale networks

Each piece addresses one of the deployment barriers.

1. Turning Ratio Randomization — training for uncertainty

Traffic is fundamentally stochastic. Vehicles turn left, right, or continue straight in proportions that fluctuate throughout the day.

Instead of training agents on a single turning pattern, the framework randomizes turning ratios in every training episode.

The perturbation follows three steps:

Sample multiplicative noise
Apply scaling to each turning ratio
Renormalize the probabilities

Conceptually:

Original ratio → Random noise → Rescaled ratio → Normalized probability

This simple change forces the agent to interpret state observations instead of memorizing timing schedules.

In machine‑learning terms, it acts as a domain randomization technique, similar to those used in robotics to bridge the sim‑to‑real gap.

2. Exponential phase duration adjustment

Most RL traffic systems adjust signal durations using linear step sizes.

Example action set:

{0, ±2, ±4, ±6, ±8}

This creates a trade‑off:

Small steps → precise but slow response
Large steps → fast but unstable

The paper proposes exponential step sizes instead:

{0, ±1, ±2, ±4, ±8}

This creates a coarse‑to‑fine control system:

Adjustment magnitude	Role
±1	fine tuning during stable traffic
±2 / ±4	moderate correction
±8	rapid response to congestion

The result resembles multi‑scale control systems used in robotics and power grids—small adjustments most of the time, large corrections only when needed.

3. Neighbor‑level coordination via CTDE

The system models traffic networks as a Decentralized Partially Observable Markov Decision Process (Dec‑POMDP) where each intersection is an agent.

To coordinate agents efficiently, the framework uses:

Centralized Training with Decentralized Execution (CTDE).

During training:

A centralized critic observes the entire network
Agents learn cooperative strategies

During deployment:

Each signal observes only itself and its neighbors
Decisions remain decentralized and scalable

This design allows the system to approximate global coordination while avoiding the computational explosion of full network observation.

Findings — Experimental results

The framework was evaluated using PTV Vissim, a high‑fidelity microscopic traffic simulator widely used in transportation engineering.

The simulated network reproduced a real corridor in Taoyuan City, Taiwan with five closely spaced intersections.

Training used peak‑hour traffic only, while testing included both peak and off‑peak scenarios to measure generalization.

Performance comparison

Method	Peak Travel Time	Off‑Peak Travel Time	Observation
Fixed timing	383.9 s	129.2 s	Baseline control
MaxPressure heuristic	265.8 s	126.6 s	Classic adaptive method
Standard RL	~249–266 s	~130–135 s	Overfits training
Robust MARL (proposed)	230.6 s	124.4 s	Best scalable model

Key takeaway:

Average waiting time dropped by more than 10% compared with leading RL baselines.

Even more interesting: the model trained under randomized conditions generalized effectively to traffic patterns it had never seen.

CTDE vs decentralized training

Another experiment compared the CTDE architecture with a fully decentralized RL algorithm.

Algorithm	Peak ATT	Off‑Peak ATT
Independent PPO (IPPO)	298.4 s	134.2 s
MAPPO with CTDE	230.6 s	124.4 s

The centralized critic significantly improves cooperation among intersections.

Without it, agents compete rather than coordinate.

Action space comparison

The exponential adjustment scheme also outperformed linear alternatives.

Action design	Peak ATT	Off‑Peak ATT
Linear adjustments	263–283 s	145–158 s
Exponential adjustments	230–234 s	124–126 s

The exponential structure prevents oscillation while maintaining responsiveness during congestion spikes.

Implications — Why this matters beyond traffic

At first glance, this research seems purely about transportation.

It is not.

It illustrates a broader pattern emerging across AI systems deployed in physical infrastructure.

1. Robust training beats perfect simulation

Randomized environments force models to learn general principles rather than memorized patterns.

This technique is increasingly used in robotics, autonomous driving, and industrial control systems.

2. Multi‑agent systems are becoming the default architecture

Urban infrastructure—from traffic networks to power grids—cannot be controlled by a single centralized brain.

Coordinated agents with limited local observations are the scalable solution.

3. AI control systems must respect real‑world constraints

Designing safe action spaces (like cyclic signal phases) is as important as designing good learning algorithms.

This is a reminder that AI engineering is partly machine learning and partly systems engineering.

Conclusion — Toward self‑organizing cities

Smart cities have long promised intelligent infrastructure, yet much of today’s urban control systems remain rule‑based relics.

The MARL framework described here moves the field closer to something different:

A network of traffic signals that learn, coordinate, and adapt continuously.

If deployed at scale, such systems could reduce congestion, emissions, and travel times without building a single new road.

Which is quietly revolutionary.

Cities rarely expand faster than their traffic problems.

But with AI, the streets themselves may finally start thinking.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — Why existing AI traffic systems struggle#

1. Overfitting to static traffic patterns#

2. Unsafe or unstable action spaces#

3. Scalability problems#

Analysis — The proposed MARL architecture#

1. Turning Ratio Randomization — training for uncertainty#

2. Exponential phase duration adjustment#

{0, ±2, ±4, ±6, ±8}#

{0, ±1, ±2, ±4, ±8}#

3. Neighbor‑level coordination via CTDE#

Findings — Experimental results#

Performance comparison#

CTDE vs decentralized training#

Action space comparison#

Implications — Why this matters beyond traffic#

1. Robust training beats perfect simulation#

2. Multi‑agent systems are becoming the default architecture#

3. AI control systems must respect real‑world constraints#

Conclusion — Toward self‑organizing cities#