When Trains Meet Snowstorms: Turning Weather Chaos into Predictable Rail Operations

A delayed train is easy to complain about and surprisingly hard to explain.

The passenger sees one number: five minutes late, twelve minutes late, cancelled, chaos. The operator sees a messier object. Was the train already late when it entered the station? Did the station itself add delay? Was the delay caused by snow, low visibility, wind, passenger boarding, a single-track bottleneck, equipment failure, or simply the accumulated sins of every previous station on the route?

For AI, that difference matters. A model cannot predict “weather disruption” from weather data unless the operational record tells it where and when the disruption actually appeared. Otherwise, the model is not learning railway reliability. It is learning a statistical soup with a timetable floating in it. Very Nordic. Very cold. Not very useful.

The paper Integrating Meteorological and Operational Data: A Novel Approach to Understanding Railway Delays in Finland builds a public dataset that combines Finnish railway operational records from Digitraffic with meteorological observations from the Finnish Meteorological Institute across 2018–2024.¹ The headline numbers are respectable: roughly 38.5 million observations in the reported integrated data, 28 engineered final features, 209 weather stations, and a baseline XGBoost model that reaches 2.73 minutes mean absolute error for station-specific delay prediction at Oulu.

But the model score is not the main story.

The more important contribution is the machinery underneath: how raw railway events, weather observations, missing sensors, duplicated records, cyclical time, and delay propagation are converted into a dataset that can support actual predictive work. In rail operations, the glamorous part is the prediction. The expensive part is making sure the prediction refers to the right physical event.

The paper is about building usable railway AI infrastructure

The authors position their dataset as a response to a familiar gap in railway AI research: many railway datasets cover operations, safety, inspection, or traffic planning, but far fewer integrate detailed meteorological observations with train-level operational records. That gap matters especially in Nordic rail systems, where snow, low temperature, visibility, icing, wind, and seasonal variation are not decorative variables. They are part of the operating environment.

The paper combines two open Finnish data sources:

Source	What it contributes	Why it matters for delay prediction
Digitraffic railway data	Train numbers, departure dates, station-level timetable rows, scheduled and actual times, station codes, train categories, cancellation flags, stop indicators, and delay in minutes	Defines the operational event: which train, which station or section, what time, and how late
Finnish Meteorological Institute observations	Temperature, wind, gusts, humidity, dew point, precipitation intensity, snow depth, pressure, visibility, cloud amount, and weather-code fields from environmental monitoring stations	Defines the local external condition around the operational event

The paper’s practical value is not that it says “bad weather causes delays.” That would be both obvious and under-proven. Its value is that it makes weather and train movement comparable at the same spatial-temporal grain.

That is the unglamorous foundation of operational AI. Before a model can predict anything meaningful, a row in the dataset must answer a basic question:

At this specific train event, near this specific railway location, at this specific time, what was the surrounding weather, and what delay target are we asking the model to learn?

Most enterprise AI projects fail not because the algorithm is too stupid, but because this row-level question is never cleanly answered. The paper is useful because it answers it with engineering discipline rather than “we merged some data” hand-waving.

Spatial matching turns weather stations into railway context

The first mechanism is spatial matching.

A train station is not a weather station. Finland’s rail network has its own geography; meteorological monitoring infrastructure has another. To integrate them, the authors assign railway locations to nearby environmental monitoring stations using Haversine distance, which calculates great-circle distance from latitude and longitude.

In plain business language: the dataset asks, “Which weather station is geographically closest to this railway point?” Then it uses that station as the local weather proxy.

This sounds simple until one remembers that infrastructure data is rarely aligned by design. Railway systems were built to move trains. Weather stations were built to observe weather. Their coordinates do not politely line up just because a machine-learning researcher has arrived with a laptop and enthusiasm.

The paper’s two-stage integration process is therefore central:

Step	Mechanism	Operational consequence
Spatial alignment	Match each train station or section to the closest environmental monitoring station using geographic coordinates	Weather becomes location-specific rather than national-average noise
Temporal alignment	Join weather observations to train records by timestamp, using nearest-neighbor temporal matching when exact timestamps are unavailable	Weather becomes event-specific rather than daily-background context
Train-record preservation	Use the train dataset as the base in the left join	Operational records are not lost merely because weather observations are incomplete

That last point is quietly important. If the merged dataset only retained train events with perfectly matched weather, it would bias the railway sample toward locations and times with better sensor coverage. That would produce a cleaner dataset and a less honest one. The authors instead preserve train records and then handle missing weather systematically.

This is the kind of design decision that rarely appears in executive AI demos, because nobody wants to put “left join discipline” on a slide. Naturally, it is also exactly where many predictive systems begin to rot.

Weather data is rich, but sensor coverage is uneven

The FMI side of the dataset is not a single neat table where every station measures everything at every minute. The paper reports 209 weather stations, with most recording at 10-minute intervals and a smaller group recording at 1-minute intervals. Coverage also varies by weather feature.

Basic thermodynamic variables such as air temperature, relative humidity, and dew-point temperature are available at nearly all stations. Wind features are available at a lower but still substantial share. Hydrometeorological and visibility-related features are more uneven: precipitation, snow depth, cloud amount, and horizontal visibility are measured by only around half of stations before mitigation.

That unevenness creates the main data engineering problem: the weather features most interesting for railway disruption are often the ones with weaker coverage. Snow depth is operationally meaningful. Visibility is operationally meaningful. Precipitation intensity is operationally meaningful. Unfortunately, data availability has never cared about what analysts find meaningful. Rude, but consistent.

The authors respond with a spatial fallback strategy. If the nearest weather station does not measure a needed weather parameter, the algorithm searches within a 50 km radius for the nearest station that does measure it. If no station within that radius has the measurement, the value remains missing.

This is not a perfect representation of local weather. A snow-depth reading 40 km away may not describe the exact condition at the railway section. But the alternative is systematic absence. The authors make a pragmatic trade-off: for regional weather phenomena, a nearby valid measurement can be more informative than a blank cell.

That decision has a clear business reading. In infrastructure analytics, the question is often not “Can we measure everything perfectly?” It is “Can we recover enough trustworthy signal to support operational triage?” The fallback strategy is a triage mechanism for the data itself.

Missing-data policy is not cleanup; it is part of the model design

The paper’s missing-data handling deserves more attention than the model section, because it defines what the model is allowed to learn.

The authors use a hierarchy:

Missing-data issue	Treatment	Why it matters
Missing timestamps	Delete the observation	Time integrity is non-negotiable for delay prediction
Missing target delay values	Delete the observation	Imputing the target would teach the model invented delays
All weather features missing	Delete the observation	The row lacks the environmental context needed for this dataset’s purpose
Weather columns above 70% missingness	Drop the feature	Sparse features risk adding noise and instability
Missing boolean stop indicators	Impute as false	Domain assumption: absent flags often indicate negation
Remaining weather gaps	Month-specific median imputation	Preserves seasonal structure while resisting outliers

This is not generic “data preprocessing.” It is an operating philosophy.

For example, the paper reports that precipitation amount remains 86.91% missing after mitigation and is therefore excluded under the >70% rule. That is an uncomfortable but sensible choice. Precipitation amount sounds useful, especially for a weather-delay dataset. But a seductive feature with massive missingness can quietly damage a model more than a boring feature with stable coverage.

The monthly median imputation is also more meaningful than it first appears. If missing temperature values were imputed with one global median, winter and summer would collapse toward the same statistical center. A month-specific median at least respects seasonality, which is central to the paper’s domain. It is still an imputation, not a miracle. But it is an imputation that knows January is not July.

The authors also remove exact duplicate records, reporting that duplicates represented 23.24% of the dataset at that stage. Again, this is not cosmetic. Duplicates can distort model training by overweighting repeated events, especially in operational logs where nested timetable rows and repeated station events may appear in dense form.

Finally, the paper applies robust scaling to weather features only after the train/test split, using training-set statistics only. That is a small sentence with large implications. Scaling before splitting would leak information from the test set into the training process. It would make the model look more generalizable than it is. The paper avoids that particular sin. A modest achievement, perhaps, but in machine learning evaluation, modest sins are how large procurement mistakes are born.

Time is cyclical, not a staircase

The paper also encodes hour, month, and day of week using sine and cosine transformations. The reason is simple: time wraps around.

Without cyclical encoding, 23:00 and 00:00 look far apart numerically, although they are one hour apart in reality. December and January can look maximally distant, although operationally they may share winter conditions. Sine-cosine encoding lets the model see circular proximity.

This is a small modeling detail, but it captures a broader lesson. Infrastructure operations are full of cyclicality: peak hours, weekdays, weekends, winter seasons, holiday effects, maintenance windows. Treating cyclical time as ordinary integers can inject fake distance into the model.

The authors retain raw month and day-of-week fields alongside encoded versions, which is sensible because different model families use features differently. Tree-based models often handle raw categorical-like numerical fields reasonably well; regression-like or distance-sensitive models may benefit more from continuous cyclical encodings.

That flexibility is part of the dataset’s design. The paper is not building one locked prediction pipeline. It is building a resource that can support multiple downstream modeling choices.

The target variable decides the business question

The most important conceptual move in the paper is not the weather merge. It is the target design.

A railway delay observed at a station can mean at least two different things. It can mean the train is currently late. Or it can mean this station or segment added new delay. Those are not the same problem.

The authors include multiple delay formulations:

Target	What it measures	Best business use	Main caution
`differenceInMinutes`	Raw cumulative delay: actual time minus scheduled time at a point	Passenger-facing ETA, service reliability reporting, disruption communication	Mixes local delay with delay inherited from earlier stations
`differenceInMinutes_offset`	Delay after removing the initial delay from the first station	Understanding delay accumulated after departure	Still may contain route-level propagation effects
`differenceInMinutes_eachStation_offset`	Incremental station- or segment-specific delay after removing delay inherited from previous stops	Local vulnerability analysis, operational diagnosis, weather-impact modeling	More useful for diagnosis, but not the same as final-arrival delay prediction
`trainDelayed`	Binary delayed/not delayed indicator using a 5-minute threshold	Classification tasks and service-level monitoring	Threshold choice affects class balance and interpretation
`cancelled`	Whether the train was cancelled	Severe disruption analysis	Cancellation causes still need external explanation

This is where the paper becomes useful for business readers.

A passenger information system cares about cumulative delay. If the train will reach Helsinki twelve minutes late, the passenger does not care whether those twelve minutes were born in Oulu, inherited from Tampere, or lovingly assembled across half the country.

An operations team cares very much. If a route repeatedly accumulates delay at a specific segment during low-visibility winter mornings, that is a different managerial problem from a train that merely carries inherited delay through the network. One asks for passenger communication. The other asks for local diagnosis.

The paper’s Oulu distribution comparison illustrates the point. For long-distance trains passing through Oulu across the seven-year period, the cumulative delay target shows 72.6% of observations with some delay and 27.1% exceeding five minutes. The station-specific offset target shows 51% with some delay and 14.2% exceeding five minutes.

The gap is the business story. Much of what passengers observe as “delay at this point” is actually propagated delay. If a model is trained on cumulative delay and then interpreted as identifying local causes, the organization will misread the system. It may blame the wrong station, the wrong weather, or the wrong operating process. Very efficient, very automated, very wrong.

The exploratory figures show patterns, not causes

The paper reports seasonal delay patterns: winter months show delay rates exceeding 25%, while late spring and early autumn are often lower. It also notes weekday variation, with Fridays showing high delay rates in parts of the year, and Saturday generally lower. Geographic clustering appears in central and northern Finland.

These figures are useful as exploratory evidence. Their likely purpose is to show that the dataset captures realistic operational patterns and that railway delay variation is not random noise.

They do not prove causality.

The authors are careful enough to note that the underlying causes of delays are not identified in the dataset. A winter delay pattern may be weather-related, but it can also reflect operational stress, infrastructure constraints, passenger volume, maintenance conditions, equipment vulnerability, or interactions among all of them. June’s relatively elevated delay percentage, for example, may relate to higher summer traffic and holiday passenger volumes, not snow suddenly developing a sense of humor.

This distinction matters. The dataset enables weather-aware modeling. It does not automatically identify weather as the cause of each delay.

Here is a disciplined reading of the evidence:

Evidence in the paper	Likely purpose	What it supports	What it does not prove
Seasonal delay figures	Exploratory validation	Delay patterns vary by season and weekday	Specific weather variables caused the delays
Station-to-weather matching map	Implementation evidence	Railway locations can be paired with nearby weather stations	The nearest station perfectly captures local track conditions
Weather-feature missingness table	Data-quality assessment	Sensor coverage is uneven and requires systematic handling	Dropped or imputed features are operationally irrelevant
Correlation heatmap	Feature-inspection aid	Some weather variables are redundant or strongly related	Multicollinearity alone determines model performance
Oulu XGBoost baseline	Dataset utility validation	The dataset can support predictive modeling	The model is production-ready or nationally generalizable
Comparison with prior studies	Contextual benchmark	Results are within a plausible delay-prediction range	Finnish rail is directly comparable with Chinese or Dutch systems

The useful reading is not “weather explains Finnish rail delays.” The useful reading is “we now have a public, structured way to study how operational delays and meteorological conditions coexist at event level.”

That is a smaller claim. It is also much more valuable.

The Oulu experiment validates the dataset, not the whole railway system

For baseline validation, the authors train XGBoost regression models on long-distance trains passing through Oulu station during 2018–2024, using 101,146 observations. The feature set includes operational variables, cyclical time encodings, train ID, and ten weather features. The target is differenceInMinutes_eachStation_offset, the station-specific delay contribution.

The reported result is 2.73 minutes MAE on the Oulu test set. This outperforms the two cumulative-delay targets tested in the paper: differenceInMinutes at 4.21 minutes MAE and differenceInMinutes_offset at 4.81 minutes MAE.

The interpretation is straightforward. Station-specific incremental delay is easier to predict than cumulative delay because it removes some propagation noise. A train arriving late because of earlier disruptions carries history from the network. A station-specific target narrows the prediction problem to local delay contribution. Less inherited mess, better model behavior. Shocking, I know.

Still, the experiment should be read as baseline validation, not a final product claim.

First, it uses one station context: Oulu. Oulu is important, but Finland is not Oulu wearing a larger coat. Second, the baseline uses historical observed weather, not necessarily forecast weather. A real operating system would need predictions before the train event, not beautifully cleaned observations afterward. Third, the model is XGBoost, not a sequence model designed to capture route dynamics over time. The authors themselves point toward LSTM, Transformer, and graph neural network approaches as possible next steps.

There is also a minor textual inconsistency worth noting: the experimental setup describes randomized search with 30 iterations and 5-fold cross-validation, while the result discussion describes performance across 50 random-search iterations. This does not invalidate the dataset contribution, but it reinforces the correct reading: the experiment is a baseline demonstration, not the paper’s main thesis.

The correct business takeaway is therefore not “buy XGBoost and fix rail delays.” Please do not put that in a board memo.

The better takeaway is this:

Once delay propagation is separated from local delay contribution, local delay becomes a more tractable prediction target.

That is operationally meaningful. It changes how a railway operator might structure dashboards, alerts, and intervention planning.

The business value is diagnosis before prediction

For rail operators, transport agencies, infrastructure planners, and mobility platforms, the paper suggests three practical pathways.

Passenger information systems need cumulative delay

Passenger-facing systems need to estimate arrival and departure times in terms users understand. For that purpose, cumulative delay targets remain useful. A passenger wants to know whether the train will arrive late, not whether the lateness was locally generated.

This is the natural domain for ETA models, disruption notifications, missed-connection warnings, and service-level communication. Weather variables can improve these systems if they help anticipate delay changes along the route.

But cumulative targets are blunt. They are good for informing users. They are weaker for diagnosing where the system is failing.

Operations teams need station-specific delay

The station-specific offset target is more useful for operational triage. If a station or section repeatedly adds delay under certain weather conditions, it becomes a candidate for closer inspection: switches, platform processes, local track conditions, clearing procedures, signaling resilience, or staff coordination.

This does not prove weather causality. It narrows the search area.

That narrowing matters financially. Infrastructure organizations do not have infinite maintenance budgets. A model that identifies local delay contribution can help prioritize investigation. The ROI is not just fewer late trains. It is cheaper diagnosis: fewer blind inspections, better corridor prioritization, and earlier identification of fragile operating points.

Planners need seasonal and geographic reliability maps

The exploratory seasonal and geographic patterns can support strategic planning, especially in winter resilience. If certain corridors show repeated high-delay patterns during winter months, agencies can compare them with snow depth, visibility, precipitation intensity, pressure shifts, and wind conditions.

Again, the dataset does not by itself identify causes. But it gives planners a structured evidence base for asking better questions.

A practical planning workflow might look like this:

Use cumulative delay to identify passenger-impact hotspots.
Use station-specific offset delay to separate local delay generation from inherited propagation.
Overlay meteorological features to detect weather-sensitive operating conditions.
Add maintenance, rolling-stock, and passenger-flow data before making causal claims.
Test interventions through operational pilots, not PowerPoint optimism.

The last step is frequently skipped. It should not be.

The dataset is a foundation for agents, but not yet an agentic railway brain

It is tempting to frame this paper as a step toward autonomous railway operations. In a narrow sense, yes: better integrated data can support automated prediction, alerting, and decision support. But that does not make the dataset an operating agent.

The paper builds a structured historical dataset. It does not implement real-time decision-making. It does not optimize train dispatching. It does not prescribe interventions. It does not estimate the causal effect of snow depth on a specific equipment failure. It does not integrate maintenance logs, rolling-stock condition, crew schedules, passenger crowding, or incident reports.

That boundary is not a weakness. It is the boundary between a dataset paper and an operational control system.

For Cognaptus readers thinking about enterprise AI, this distinction is familiar. A good dataset does not automate a business process by itself. It makes automation less stupid.

In a railway setting, the dataset could become part of a larger decision architecture:

Layer	What the paper provides	What still needs to be added
Data foundation	Historical train-weather event records	Streaming ingestion, validation, versioning
Prediction	Baseline feasibility using XGBoost	Forecast weather, route-sequence models, national validation
Diagnosis	Station-specific delay target	Causal inference, incident labels, maintenance records
Decision support	Potential local vulnerability signals	Cost models, intervention rules, dispatch constraints
Automation	Structured features for downstream systems	Human-in-the-loop workflows, safety governance, escalation logic

This is where the paper’s mechanism-first contribution becomes strategically interesting. The authors are not merely showing that a model can make a prediction. They are showing how to build the data substrate on which more serious operational intelligence can later sit.

That is less flashy than “AI dispatcher.” It is also where real systems usually begin.

Where the evidence stops

The paper’s boundaries are clear and should be kept clear.

First, the baseline experiment is narrow. It focuses on long-distance trains passing through Oulu. That makes sense for validation, but it does not establish performance across every Finnish route, commuter service, freight context, or extreme weather scenario.

Second, the dataset uses observed historical weather. For live operations, forecast weather would be required to generate actionable predictions before delays occur. The authors mention this as a future direction.

Third, the dataset lacks explicit causal labels for delay sources. Digitraffic contains cause-related fields in the raw structure, but the dataset and analysis discussed in the paper do not resolve delay causality. Without incident, maintenance, rolling-stock, crew, passenger-flow, and infrastructure-condition data, causal interpretation remains limited.

Fourth, the spatial fallback strategy is pragmatic, not perfect. A 50 km weather proxy can be reasonable for regional conditions but weaker for localized snowfall, microclimates, wind exposure, or station-specific infrastructure vulnerability.

Fifth, the target design is powerful but requires careful use. Station-specific offset delay is better for local diagnosis, but passenger-facing applications still need cumulative delay. Choosing the wrong target can make a technically accurate model operationally misleading.

These limitations do not reduce the paper’s value. They define its proper use. A map is more useful when you remember it is not the territory. Especially when the territory is covered in snow.

The larger lesson: operational AI starts before modeling

The paper’s strongest lesson applies beyond Finland and beyond railways.

Many organizations want predictive AI for infrastructure: ports, airports, warehouses, energy grids, logistics fleets, public transport, hospitals. The pattern is the same. They have operational logs. They have environmental or external data. They have timestamps that do not quite match, locations that do not quite align, missing fields, duplicated records, inherited effects, and target variables that quietly answer different business questions.

Then someone asks for a model.

This paper is a useful reminder that the model is late in the story. The real work starts earlier:

define the physical event;
align external context to that event;
preserve operational records without pretending missing context does not exist;
decide what missingness means;
separate accumulated outcomes from local contributions;
prevent leakage;
validate with a baseline;
resist turning predictive association into causal explanation.

That sequence is not glamorous. It is also the difference between an AI system that helps operations and one that produces confident numerology.

For railway operators, the immediate implication is practical: weather-aware delay prediction should not begin with model selection. It should begin with target design and data alignment. Passenger ETA, local vulnerability diagnosis, and infrastructure planning are different problems. They may share data, but they should not share the same target blindly.

For business leaders, the lesson is even simpler. Before asking whether AI can predict disruption, ask whether your data can distinguish disruption from propagation. If it cannot, your model may still predict something. It just may not be the thing you think you bought.

The Finnish railway-delay dataset is valuable because it makes that distinction visible. It turns snowstorms, timetables, station events, and sensor gaps into a structured analytical object. Not a complete operating system. Not causal truth. Not magic.

Just the kind of boring, careful data infrastructure without which “smart operations” remains a slogan with a dashboard attached.

Cognaptus: Automate the Present, Incubate the Future.

Integrating Meteorological and Operational Data: A Novel Approach to Understanding Railway Delays in Finland, arXiv:2601.16592, https://arxiv.org/html/2601.16592. ↩︎

The paper is about building usable railway AI infrastructure#

Spatial matching turns weather stations into railway context#

Weather data is rich, but sensor coverage is uneven#

Missing-data policy is not cleanup; it is part of the model design#

Time is cyclical, not a staircase#

The target variable decides the business question#

The exploratory figures show patterns, not causes#

The Oulu experiment validates the dataset, not the whole railway system#

The business value is diagnosis before prediction#

Passenger information systems need cumulative delay#

Operations teams need station-specific delay#

Planners need seasonal and geographic reliability maps#

The dataset is a foundation for agents, but not yet an agentic railway brain#

Where the evidence stops#

The larger lesson: operational AI starts before modeling#