The delivery route is not a sentence

A delivery van does not move like a sentence.

It stops. It waits. It turns left because a road exists, not because grammar allows it. Its next point depends on geography, time of day, congestion, driver behavior, business constraints, and occasionally the small civic miracle of a loading bay being available. A language model sees the world as tokens arranged in sequence. A trajectory model sees movement as a sequence too, but the symbols are less polite: latitude, longitude, timestamp, region, point of interest, dwell time, elapsed time, and missing segments.

That is why the paper Building a Foundation Model for Trajectory from Scratch is useful, even though it is not trying to win the usual leaderboard beauty contest.1 Its value is not “here is the next gigantic model for mobility.” Its value is “here is what has to be changed before a GPT-style model can even pretend to understand movement.”

That distinction matters. The AI industry has become very good at treating “foundation model” as a magic adjective. Add it to medicine, finance, law, manufacturing, or urban systems, and suddenly the slide deck starts glowing. Mobility data is less impressed. You cannot simply throw GPS traces into a text model and hope the Transformer develops a sense of roads, time, and distance. The model first needs a new way to read the input.

This paper is best read as a code-driven educational bridge: start with GPT-2, remove the parts that only make sense for words, replace them with spatiotemporal machinery, and then compare the simplified prototype with more advanced trajectory foundation models such as TrajFM and TrajGPT. The point is architectural literacy. Slightly less glamorous than “AI predicts the city,” yes. Also far more useful.

The first substitution: words become points, and tokenization breaks

GPT-2 begins with a familiar move: text is split into tokens, each token maps to an embedding, positional information is added, and Transformer blocks learn relationships across the sequence. That pipeline works because language tokens belong to a vocabulary. “Truck,” “arrived,” and “late” can be represented as discrete items. The model predicts the next token from a finite set.

A trajectory is not like that. Latitude and longitude are continuous. Time is continuous but also cyclic. A timestamp can be split into hour, day of week, minute, elapsed time, and other features. A location may also have semantic meaning: warehouse, office tower, highway segment, port gate, airport terminal, mall entrance. Treating all this as if it were a sentence is the first mistake.

The paper therefore begins its adaptation by removing the normal tokenizer and replacing it with a custom trajectory parser. Each trajectory point becomes a structured spatiotemporal record rather than a word-like symbol. Coordinates are normalized. Time is decomposed into useful features. The authors also suggest using small pretext models, such as autoencoders, to check whether the encoding still preserves the original position.

That small detail deserves more attention than it usually gets. Encoding is not administrative plumbing. It is the place where the business problem quietly becomes the machine-learning problem. If a logistics firm encodes vehicle movement badly, the model may learn artifacts of the coordinate system rather than movement patterns. If a retail analytics team ignores time structure, weekend shopping trips and weekday commutes may collapse into the same behavioral soup. If an insurer encodes location too coarsely, “risky driving” may become indistinguishable from “the driver lives near a badly designed intersection.”

A language tokenizer asks, “Which word is this?” A trajectory encoder must ask several questions at once:

Input element Why text-style GPT cannot use it directly Mobility-model replacement
Latitude and longitude Continuous values, not vocabulary items Normalized spatial coordinates projected into vectors
Timestamp Continuous, cyclic, and context-dependent Split temporal features such as hour, minute, day, and elapsed time
Sequence order Position is not merely word order Positional or spatiotemporal encoding suited to trajectories
Missing route segments Not equivalent to missing words Masking strategies for points, dimensions, or whole segments
Movement target Next point may be easier as a change than an absolute coordinate Delta prediction between consecutive points

This is the paper’s first useful correction to the popular view of foundation models. The foundation is not the Transformer block alone. The foundation is the full interface between messy domain data and a sequence learner. The Transformer is the visible architecture. The encoder is where the domain enters the room.

Projection layers are where raw movement becomes model language

Once trajectory points are encoded as structured features, they still need to be converted into a representation the Transformer can process. GPT-2 uses an embedding matrix for tokens. The educational trajectory model replaces that with a learnable projection layer.

Mechanically, this is simple: map the input trajectory features into a higher-dimensional vector space. Conceptually, it is the moment where continuous movement becomes “model language.” Latitude, longitude, time components, and other trajectory features are no longer treated as separate spreadsheet columns. They become a dense vector representation that downstream Transformer blocks can manipulate.

The paper’s tutorial approach is useful because it makes this replacement explicit. In many AI papers, projection layers appear as a few lines in the architecture diagram and then politely vanish. Here they are central. Without this projection step, a Transformer inherited from text modeling has no clean way to process continuous mobility data.

For business readers, the projection layer is not just a technical layer. It is the boundary between operational reality and reusable intelligence. A fleet-management model, a city-planning model, and a retail-footfall model may all use trajectory sequences, but they do not necessarily care about the same features. One may emphasize travel time; another dwell time; another cross-region transferability; another anomalous route deviation. The projection layer is where those features are made available to the model.

This also explains why “foundation model for mobility” should not be interpreted too casually. A model trained on taxi traces in one city may learn useful movement priors, but whether those priors transfer to delivery vans, bus routes, warehouse forklifts, port trucks, or mall visitors is not automatic. The input representation has to carry the right invariances. It must preserve what generalizes and avoid overfitting to what merely happens to be true in one dataset.

The paper does not solve that transfer problem empirically. It gives readers the machinery needed to understand where the problem lives.

Delta prediction: sometimes the next move is easier than the next place

A major design choice in the tutorial model is delta encoding. Instead of predicting an absolute next coordinate and timestamp, the model predicts changes between consecutive trajectory points.

That sounds like a small mathematical convenience. It is more than that.

Absolute coordinates vary widely. The same physical movement can have very different coordinate values depending on the city, coordinate system, data collection process, and route length. Predicting “the next latitude and longitude” asks the model to learn global position and movement at the same time. Predicting a delta narrows the task: given the current point, estimate the next movement step.

For autoregressive modeling, this is natural. The model moves one step at a time. It asks not “where in the universe are we going?” but “from here, what is the likely change?” That is closer to how routing, tracking, and trajectory completion often operate in practice.

For business interpretation, delta prediction is especially relevant in operational systems where relative movement matters more than absolute identity. In last-mile delivery, the next turn, next delay, or next deviation may matter more than the raw coordinate. In vehicle anomaly detection, unusual changes can reveal detours, route violations, or sensor problems. In ETA estimation, time deltas may be more useful than raw timestamps.

But the boundary is equally important. Delta prediction does not magically solve geography. Roads constrain possible movement. Traffic changes movement. Business rules constrain stops. A model that predicts deltas from traces still needs enough data and structure to distinguish a legal movement pattern from a geometrically possible but operationally absurd one. The paper’s tutorial model makes the target more learnable; it does not replace routing engines, map matching, or domain validation.

That is a recurring theme here: the model becomes plausible only after several careful substitutions, not because GPT-style architecture alone has spatial common sense. It does not. It has matrix multiplication with ambition.

Masking turns route prediction into gap repair

The tutorial also introduces a masked variant of the trajectory model. This allows the model to fill missing segments of a trajectory, not only predict the next point.

This is where the mobility analogy to language becomes more productive. In language modeling, masking can train a model to recover missing words or reason from surrounding context. In trajectory modeling, masking can train a model to infer missing points, missing dimensions, or missing route segments. That maps naturally onto real mobility data, where gaps are common.

GPS signals drop. Mobile devices sleep. Logistics systems record stops but not the path between them. Ride-hailing traces may be partially anonymized. Warehouse or port movement data may be irregular. In other words, missingness is not a rare data problem. It is part of the operating environment.

The paper later contrasts its simpler masked setup with TrajFM’s more sophisticated masking strategies: dimension-specific masking and segment masking. The difference is useful. A simple next-step model learns continuation. A richer masking model can support a broader set of tasks: path inference, travel-time estimation, trajectory completion, and possibly robustness to partially observed movement.

That makes masking one of the clearest bridges from tutorial code to business use.

Modeling task What the model learns Business-facing interpretation Boundary
Next-step prediction Continue a trajectory from prior points Route continuation, short-term movement forecasting Does not guarantee map-valid routes
Delta prediction Estimate movement changes ETA refinement, deviation detection, smoother learning targets Still depends on data quality and sampling frequency
Segment masking Fill missing chunks of movement Gap repair, trip reconstruction, privacy-aware partial traces Can hallucinate plausible but false paths
Dimension masking Recover missing time or spatial dimensions Cleaning incomplete mobility records Requires validation against operational ground truth

The dangerous word here is “plausible.” A model may generate a route segment that looks statistically reasonable but never happened. For synthetic data, that may be acceptable if privacy and distributional fidelity are the goal. For compliance, billing, insurance, or incident investigation, plausible reconstruction is not evidence. The article practically writes its own warning label.

The paper’s evidence is implementation evidence, not benchmark evidence

A reader expecting tables of accuracy results may feel underfed. That reaction is understandable, but it misses the genre of the paper.

This is a tutorial paper for SIGSPATIAL-style researchers and practitioners. It explains how to build a minimal trajectory foundation model from GPT-2 concepts, provides open-source material, and situates the educational model against more advanced approaches. Its evidence is not “our model beats all prior methods.” Its evidence is “these are the implementation steps needed to make the idea concrete.”

That makes the paper more useful for one audience and less useful for another.

If you are choosing a production vendor, the paper does not give you a leaderboard. If you are deciding whether to approve a multimillion-dollar mobility-AI deployment, it does not give you ROI evidence. If you are benchmarking transfer learning across cities, vehicle types, and sampling rates, it does not settle the matter.

But if you are trying to understand what a trajectory foundation model actually contains, the paper is valuable. It shows the substitutions:

  1. Replace text tokenization with trajectory parsing.
  2. Replace word embeddings with learnable projection layers over spatiotemporal features.
  3. Replace simple text positional assumptions with trajectory-aware positional handling.
  4. Use streaming data loaders because mobility datasets can be large.
  5. Use synthetic data generators for controlled testing.
  6. Reduce the GPT-2 architecture for educational and small-data settings.
  7. Predict deltas rather than only absolute coordinates.
  8. Add masking to move beyond next-step prediction.
  9. Compare this minimal model with TrajFM, TrajGPT, TimesFM-style patching, and generative approaches.

That list is not a benchmark result. It is a checklist for architectural due diligence.

For Cognaptus readers, that is probably the more useful artifact. Many organizations do not first fail because they chose the wrong SOTA model. They fail earlier, by not understanding whether the model’s representation matches the operational data. Very elegant nonsense is still nonsense. It just has better typography.

TrajFM and TrajGPT show two different answers to the same problem

The paper’s comparison with TrajFM and TrajGPT is not a side tour. It helps clarify the design space.

TrajFM keeps closer to continuous spatial prediction. It combines coordinates, temporal features, and POI semantics, uses richer encoding methods such as learnable Fourier encodings for time, and adapts positional embedding through RoPE for spatiotemporal relationships. It also uses more sophisticated masking. In the paper’s framing, TrajFM is closer to the idea of a transferable mobility foundation model across geographic contexts.

TrajGPT takes a different route. It reframes trajectory prediction as region and time prediction. Instead of directly producing coordinates, it predicts regions or POIs and uses specialized modules for temporal attributes such as travel time and duration. That makes it conceptually closer to token prediction, because regions can behave more like discrete items than raw coordinates do.

Neither choice is universally superior. They answer different operational questions.

Design choice Better fit when… Trade-off
Continuous-coordinate prediction Fine-grained spatial movement matters Harder learning target; more sensitive to coordinate scaling and geography
Region or POI prediction Business units care about zones, stops, or destinations Loses fine-grained movement detail
Rich temporal encoders Time patterns are central to the task Requires careful feature design and validation
Multi-head outputs The task combines destination, travel time, duration, or uncertainty More complex training and evaluation
Segment masking Missing data and route completion are important Plausible completion may be mistaken for factual reconstruction

This comparison is where the tutorial becomes more than code. It teaches a decision habit: do not ask whether a model is “trajectory-aware” in general. Ask what kind of trajectory target it predicts, what it treats as discrete or continuous, how it represents time, and what kind of missingness it can tolerate.

A city-planning use case may prefer zones and aggregate flows. A fleet-safety use case may need fine-grained route deviations. A retail-site analysis may care about dwell time and transitions between POIs. A synthetic-data use case may care less about exact individual paths and more about distributional realism. The same Transformer family can support these tasks only after the architecture is shaped around the target.

That shaping is the whole story.

Patching and synthetic trajectories point toward scale, but not shortcuts

The paper also references other ideas from related domains, including TimesFM-style patching and TrajGDM-style synthetic trajectory generation.

Patching groups consecutive values into larger units, reducing sequence length and computational overhead. In time-series modeling, this can make long sequences easier to process. For mobility, the analogy is tempting: instead of handling every raw GPS point, group segments into patches. Less sequence length, less compute, possibly more scalable modeling.

But mobility patches are not neutral. A patch can hide the very behavior the business cares about. If a model compresses a sequence too aggressively, it may smooth away detours, stops, sharp turns, unsafe maneuvers, or dwell-time anomalies. Patching is not simply a cheaper version of full trajectory modeling. It is a lossy abstraction, and the loss must be aligned with the use case.

Synthetic trajectory generation has a different appeal. Privacy, data scarcity, simulation, and augmentation all make synthetic movement data attractive. A model that can generate realistic traces could support urban planning experiments, logistics stress tests, or privacy-preserving data sharing.

Again, the boundary matters. Synthetic trajectories are useful when they preserve relevant distributions without exposing sensitive individuals. They are dangerous when treated as real observations. For privacy-sensitive industries, the hard question is not whether generated paths look realistic in a demo. It is whether they leak individual behavior, distort rare events, or create false confidence in downstream systems.

The paper does not resolve these issues. It points to the methods that make them visible. That is still progress.

The business value is architectural literacy before automation

The immediate business relevance of this paper is not that a company should download the tutorial code on Monday and automate a national logistics network by Friday. Please do not do that. The trucks have suffered enough.

The relevance is that mobility-heavy organizations need a better way to evaluate trajectory AI claims. The paper provides a practical vocabulary for doing that.

A logistics company can ask whether a vendor predicts absolute positions, movement deltas, regions, or travel-time distributions. A city agency can ask how the model handles missing segments and whether generated trajectories are map-valid. An insurer can ask whether driver behavior is encoded at a useful level or blurred into route geography. A retail operator can ask whether POI semantics are part of the representation or added later as dashboard seasoning.

The practical pathway looks like this:

Paper mechanism Business question it enables What remains uncertain
Custom trajectory encoder Does the model represent movement, time, and location meaningfully? Whether the chosen encoding transfers across regions and business contexts
Projection layer Are heterogeneous trajectory features integrated before attention? Whether important operational variables are missing
Delta prediction Does the model learn movement dynamics rather than only absolute coordinates? Whether predictions remain valid under unusual routes or rare events
Masking Can the model handle incomplete trajectories? Whether filled gaps are accurate enough for the use case
Streaming data loader Can large mobility datasets be processed practically? Whether training cost and latency fit production constraints
Comparison with TrajFM and TrajGPT Which architecture family fits the task? Whether benchmark results hold on the firm’s own data

This is the boring-but-important layer of AI adoption. Before buying or building a mobility foundation model, a firm needs to know whether the model’s assumptions match the business process. In many cases, that question is more decisive than the brand name of the base architecture.

Where the paper stops, and where deployment begins

The paper is disciplined about its purpose. It describes an educational prototype, not a production-grade system. That should shape how we interpret it.

First, it does not provide a full empirical benchmark proving that the simplified model performs competitively against TrajFM, TrajGPT, or other specialized systems. Its comparison with those models is explanatory and architectural, not a final ranking.

Second, it does not settle the data governance problem. Trajectory data is often sensitive. It can reveal homes, workplaces, routines, health visits, religious attendance, labor patterns, and commercial secrets. A mobility foundation model may increase analytical power, but it also increases the need for privacy controls, access governance, retention policies, and careful synthetic-data evaluation.

Third, it does not prove geographic or domain transferability. A model trained on one city, vehicle type, sampling rate, or business process may not generalize cleanly to another. Transferability must be tested, not assumed because the phrase “foundation model” appears in the abstract.

Fourth, it does not replace classical mobility infrastructure. Map matching, routing engines, simulation tools, database systems, and domain-specific rules remain relevant. The point is not to delete the existing stack and let a Transformer improvise urban planning. The point is to understand where sequence learning can complement it.

These boundaries do not weaken the paper. They clarify its use. It is a tutorial map, not the destination.

The real lesson: GPT can walk only after someone teaches it geography

The clever part of this paper is not that it attaches GPT-style machinery to trajectory data. Many papers now attach Transformers to whatever data format happens to be nearby. The clever part is that it slows down the translation.

It asks: What happens to tokenization? What replaces embeddings? How should time be represented? Should the target be absolute position or movement delta? What does masking mean when the missing object is not a word but a route segment? How do more advanced models make different choices?

That is the right order of thinking. Mechanism first. Results later. Deployment much later.

For businesses working with mobility data, the article-level takeaway is simple: trajectory foundation models are not just LLMs with coordinates stapled to the input. They require new encoders, new targets, new masking logic, and new validation habits. Once those pieces are understood, the commercial possibilities become more concrete: route completion, ETA improvement, fleet anomaly detection, location intelligence, synthetic mobility data, cross-region transfer learning, and mobility simulation.

But the responsible interpretation is equally concrete. This paper does not prove that a tutorial model is ready for production. It teaches what production claims should be interrogated for.

GPT can go for a walk. But first, someone has to teach it that a walk is not a sentence, a road is not a token, and “next” depends on more than grammar.

Cognaptus: Automate the Present, Incubate the Future.


  1. Gaspard Merten, Mahmoud Sakr, and Gilles Dejaegere, “Building a Foundation Model for Trajectory from Scratch,” arXiv:2511.20610, 2025, https://arxiv.org/abs/2511.20610↩︎