Opening — Why this matters now
Pedestrian fatalities are rising, mid-block crossings dominate risk exposure, and yet most models tasked with predicting pedestrian behavior remain stubbornly local. They perform well—until they don’t. Move them to a new street, a wider arterial, or a different land-use mix, and accuracy quietly collapses.
This is not a data problem. It’s a reasoning problem.
The paper behind today’s discussion proposes something unfashionable but effective: instead of forcing pedestrian behavior into ever-more elaborate numerical fits, let models reason about why people cross where they do. The result is PedX‑LLM, a vision‑ and knowledge‑enhanced large language model that treats crossing decisions less like a regression target and more like a human judgment call.
Background — What existed before (and why it stalled)
Pedestrian crossing inference has followed a predictable arc:
| Paradigm | Strength | Structural Limitation |
|---|---|---|
| Logistic & hierarchical regression | Interpretable coefficients | Linear assumptions, weak interactions |
| Tree / boosting models | Nonlinear fits | Site‑specific pattern learning |
| Deep tabular models | Higher capacity | Overfitting, poor transfer |
Across all three, the failure mode is the same: generalization. Models learn where people crossed in the past, not why they chose to cross.
LLMs, in theory, should help. They encode priors about human decision-making. In practice, most prior attempts stop at prompt engineering—text in, text out—without grounding those priors in transportation science or the physical environment. Worse, many rely on cloud APIs that are unusable for agencies constrained by privacy and data governance.
Analysis — What PedX‑LLM actually does
PedX‑LLM reframes the task entirely. Crossing choice (intersection vs. mid‑block) becomes a language reasoning problem, not a pure classification exercise.
1. Vision is used for context, not detection
Satellite imagery is processed by a vision‑language model (LLaVA) to produce textual descriptions of the built environment: road width, land‑use density, spatial organization. These descriptions—150–200 words per site—are fed downstream as contextual evidence, not as raw pixels.
This avoids brittle feature extraction while preserving macro‑scale spatial reasoning.
2. Domain knowledge is injected explicitly
Rather than hoping the model “figures it out,” PedX‑LLM encodes empirically validated behavioral rules directly into the prompt:
- Older pedestrians exhibit higher delay tolerance
- Solo walkers accept higher risk than groups
- Lighting reduces perceived mid‑block risk
- Wide arterials deter crossings nonlinearly
This transforms the LLM from a correlation engine into a constrained behavioral reasoner.
3. Fine‑tuning is surgical, not destructive
Using LoRA, only ~0.46% of parameters are trained. The base linguistic model remains intact; the adapter learns transportation‑specific reasoning. Combined with 4‑bit quantization, the entire system runs locally—no data leakage, no cloud dependencies.
Findings — Results that actually generalize
In‑sample performance
| Model | Balanced Accuracy |
|---|---|
| Hierarchical Logistic Regression | 74.1% |
| CatBoost (best ML baseline) | 79.0% |
| PedX‑LLM (text‑only) | 75.0% |
| PedX‑LLM + vision | 77.9% |
| PedX‑LLM + vision + knowledge | 82.0% |
Vision adds context. Knowledge adds meaning. Together, they win.
Cross‑site generalization (the real test)
| Model | Balanced Accuracy (Unseen Sites) |
|---|---|
| CatBoost | 48.3% |
| TabNet | 43.6% |
| PedX‑LLM (zero‑shot) | 66.9% |
| PedX‑LLM (few‑shot, 5 examples) | 72.2% |
Eighteen percentage points of improvement without retraining is not incremental—it’s categorical.
Interpretability — Reasoning you can audit
PedX‑LLM does not hide behind attention weights. Shapley‑based attribution decomposes each decision into seven prompt components.
| Component | Contribution |
|---|---|
| Pedestrian demographics | 25.8% |
| Traffic control | 21.8% |
| Domain knowledge | 12.7% |
| Road geometry | 12.5% |
| Vision‑derived environment | 12.1% |
Crucially, the same infrastructure produces different effects for different pedestrians—exactly what conventional models fail to capture.
Implications — Why this matters beyond crossings
PedX‑LLM quietly demonstrates a broader pattern:
- Multimodal LLMs outperform when vision is descriptive, not predictive
- Domain knowledge must be encoded, not assumed
- Generalization emerges from reasoning constraints, not bigger datasets
For agencies, this means:
- Viable zero‑shot evaluation of new sites
- Privacy‑preserving local deployment
- Models that explain why an intervention works
For AI practitioners, the lesson is sharper: foundation models do not replace domain science—they amplify it.
Conclusion — From fitting curves to understanding people
PedX‑LLM does not win by being larger or flashier. It wins by being disciplined: vision for context, knowledge for structure, language for reasoning.
In pedestrian safety—as in most applied AI problems—the future does not belong to models that memorize yesterday’s streets. It belongs to those that understand why people cross the line.
Cognaptus: Automate the Present, Incubate the Future.