Opening — Why this matters now

Pedestrian fatalities are rising, mid-block crossings dominate risk exposure, and yet most models tasked with predicting pedestrian behavior remain stubbornly local. They perform well—until they don’t. Move them to a new street, a wider arterial, or a different land-use mix, and accuracy quietly collapses.

This is not a data problem. It’s a reasoning problem.

The paper behind today’s discussion proposes something unfashionable but effective: instead of forcing pedestrian behavior into ever-more elaborate numerical fits, let models reason about why people cross where they do. The result is PedX‑LLM, a vision‑ and knowledge‑enhanced large language model that treats crossing decisions less like a regression target and more like a human judgment call.

Background — What existed before (and why it stalled)

Pedestrian crossing inference has followed a predictable arc:

Paradigm Strength Structural Limitation
Logistic & hierarchical regression Interpretable coefficients Linear assumptions, weak interactions
Tree / boosting models Nonlinear fits Site‑specific pattern learning
Deep tabular models Higher capacity Overfitting, poor transfer

Across all three, the failure mode is the same: generalization. Models learn where people crossed in the past, not why they chose to cross.

LLMs, in theory, should help. They encode priors about human decision-making. In practice, most prior attempts stop at prompt engineering—text in, text out—without grounding those priors in transportation science or the physical environment. Worse, many rely on cloud APIs that are unusable for agencies constrained by privacy and data governance.

Analysis — What PedX‑LLM actually does

PedX‑LLM reframes the task entirely. Crossing choice (intersection vs. mid‑block) becomes a language reasoning problem, not a pure classification exercise.

1. Vision is used for context, not detection

Satellite imagery is processed by a vision‑language model (LLaVA) to produce textual descriptions of the built environment: road width, land‑use density, spatial organization. These descriptions—150–200 words per site—are fed downstream as contextual evidence, not as raw pixels.

This avoids brittle feature extraction while preserving macro‑scale spatial reasoning.

2. Domain knowledge is injected explicitly

Rather than hoping the model “figures it out,” PedX‑LLM encodes empirically validated behavioral rules directly into the prompt:

  • Older pedestrians exhibit higher delay tolerance
  • Solo walkers accept higher risk than groups
  • Lighting reduces perceived mid‑block risk
  • Wide arterials deter crossings nonlinearly

This transforms the LLM from a correlation engine into a constrained behavioral reasoner.

3. Fine‑tuning is surgical, not destructive

Using LoRA, only ~0.46% of parameters are trained. The base linguistic model remains intact; the adapter learns transportation‑specific reasoning. Combined with 4‑bit quantization, the entire system runs locally—no data leakage, no cloud dependencies.

Findings — Results that actually generalize

In‑sample performance

Model Balanced Accuracy
Hierarchical Logistic Regression 74.1%
CatBoost (best ML baseline) 79.0%
PedX‑LLM (text‑only) 75.0%
PedX‑LLM + vision 77.9%
PedX‑LLM + vision + knowledge 82.0%

Vision adds context. Knowledge adds meaning. Together, they win.

Cross‑site generalization (the real test)

Model Balanced Accuracy (Unseen Sites)
CatBoost 48.3%
TabNet 43.6%
PedX‑LLM (zero‑shot) 66.9%
PedX‑LLM (few‑shot, 5 examples) 72.2%

Eighteen percentage points of improvement without retraining is not incremental—it’s categorical.

Interpretability — Reasoning you can audit

PedX‑LLM does not hide behind attention weights. Shapley‑based attribution decomposes each decision into seven prompt components.

Component Contribution
Pedestrian demographics 25.8%
Traffic control 21.8%
Domain knowledge 12.7%
Road geometry 12.5%
Vision‑derived environment 12.1%

Crucially, the same infrastructure produces different effects for different pedestrians—exactly what conventional models fail to capture.

Implications — Why this matters beyond crossings

PedX‑LLM quietly demonstrates a broader pattern:

  • Multimodal LLMs outperform when vision is descriptive, not predictive
  • Domain knowledge must be encoded, not assumed
  • Generalization emerges from reasoning constraints, not bigger datasets

For agencies, this means:

  • Viable zero‑shot evaluation of new sites
  • Privacy‑preserving local deployment
  • Models that explain why an intervention works

For AI practitioners, the lesson is sharper: foundation models do not replace domain science—they amplify it.

Conclusion — From fitting curves to understanding people

PedX‑LLM does not win by being larger or flashier. It wins by being disciplined: vision for context, knowledge for structure, language for reasoning.

In pedestrian safety—as in most applied AI problems—the future does not belong to models that memorize yesterday’s streets. It belongs to those that understand why people cross the line.

Cognaptus: Automate the Present, Incubate the Future.