How to learn to predict late trains

Rail transport, both freight and passenger - one of the most popular modes of transport in our country. The fact that trains only travel on rails, on the one hand, simplifies and reduces the variability of the model, on the other - adds a lot of dependencies. If any unforeseen situation occurs on the tracks, this can have significant consequences for the entire network. Deviation from the schedule of one train or accident on the rails can affect the movement of the whole direction. This affects both companies that do not receive cargo in time, and passengers who may be late for work, miss the right train, get on the wrong train, or mumble the extra half hour (or even more) on the platform.

My name is Alexander Podlevskikh, I’m the lead developer at Tutu.ru, the team leader in the Electric Team, and in this article I’ll tell you how we predict the deviations of the electric trains from the schedule - lateness and advance. And also about what MCC Railways is, how the suburban rail transport system is technically arranged, and how we tell passengers about delays.

What is the actual following trains and trains

The Main Computing Center of the Russian Railways (MCC Russian Railways) is an organization whose main task is the information support of Russian Railways. In essence, the MCC is the IT Railways.
')
The MCC system contains and processes colossal amounts of data. In the previous article, I described the interaction of Tutu.ru with the MCC, working with the data of the basic schedule of electric trains and with the options for their movement. Soon after that, we also included the use of data on actual commuter trains.

The data on the actual route are a set of records in the database, each of which is the fact that the train was passing through a control point in the route. These facts are recorded by sensors located at the entrance and at the exit from the stations. There are four types of sensors:

ASOUP (automated system of operational traffic management). In essence, this is a system that aggregates data from the tracks and allows you to monitor all shipments. In particular, the observance of plans for the formation, mass and length of freight trains, forecasting the arrival of goods, taking into account the transfer of trains, cars and containers through the junction points of roads and offices ( ASOUP ). From here to the MCC there are data on whether the train has driven the station or not.
(signaling, centralization and blocking devices) - a set of technical means used to regulate and ensure train traffic safety ( Signaling devices, centralization and blocking devices ). In fact, these devices are sensors on the railway arrows, the data from which are fed to the MCC.
SAIPS (Automation System of Rolling Stock Identification) - devices standing along the track and fixing passing cars.
GIS (Geographic Information System of Russian Railways) is a geolocation system based on GLONASS, which allows you to monitor the performance of railway transport, speed at sites, traffic safety violations and more.

For each station, data in the MCC may come from several sources, depending on the equipment installed on it. In addition, different sources may have different accuracy. For example, we analyzed that the signaling probability of SBS is significantly lower than that of SAIPS. Therefore, if there is data from both sources for the same event, you give priority to the SCB. And at the moment, the signaling system is the main source of data. In addition, we provide a source of data for each time on the route pages. This is a story, first of all, for advanced users who, having this data, can understand the approximate accuracy of the forecast.

All this data is collected into the DB2 database in the MCC system, and from there the partners take it for their own needs.

Interaction with the MCC

Rules for ensuring antiterrorist security require the use of data of actual train traffic not less than 10 minutes old. In addition, since the system is heavily loaded, partners (such as Tutu.ru ) can access it no more than once every few minutes. Because of this, users are not entirely available to actually follow the composition, but how he followed 10 minutes ago. Moreover, the data comes only from sensors at the stations, and at shallow stops (where even the platform may not be) or just along the way, there are practically none.

And there are also situations when the station data is linked to the wrong composition. As a rule, such situations occur infrequently and are corrected by the MCC employees rather quickly. However, in order to minimize such cases in our local data warehouse, it is necessary to insure each time and request a greater depth of data from the MCC than is necessary. And even in this case, we can not be sure of the data at 100% - too many variables and all but. But the error rate is negligible (about 1 error per 100,000 entries).

Using

To obtain and store actual follow-up data, a connection mechanism to the MCC was used, similar to obtaining the data of planned schedules (I described the mechanism in the previous article ). And for their use on our site, previously constructed links of our local data and the MCC database, as well as links of stations were useful in many ways (the algorithm is also described in the previous article). With the help of this correspondence, it was possible to link approximately 90% of our entities with actual data. For the rest of the trains, we found new connections by searching by train and station numbers.

So, what was the problem:

We have a timetable page on the site (for example, Moscow - Petushki ), to display which the “timetable” service is polled. It takes as input the identifiers of the stations of departure and destination and the date of the trip, and in response provides a set of objects containing data of the electric trains passing between these stations. It includes train identifiers - number, starting / ending station, schedule, and information about a particular train ride from one station to another - arrival / departure time, platform. The complete data of the train route (time of passage of each intermediate station) does not contain the answer - such a volume of information would be too large, the calculations were made longer and simply this is unnecessary.

We had to calculate the mark of possible delay / advance on the basis of actual follow-up data, which are compared with the schedule data from our local database. Sometimes the calculation is not based on the last record of the passage of the checkpoint by train, but on several. This is what happens when different sensors work with varying degrees of accuracy. For example, at some station there is a strong difference between the actual train and the planned one. That's when we look at the time of the passage of the previous station.

That is, we were faced with the task of showing users information about a possible deviation of trains from the schedule.

We would not have enough actual data from the Main Computer Center to solve this problem. And we made a microservice, the task of which is to return the full route using the train identifier, which includes not only the data of the planned schedule, but also the actual data of the stations.

Work algorithm

The algorithm of the service operation is as follows: at the entrance we get the train identifier and by it we make a request to the service of the schedule for obtaining the actual route. Then another request is sent to a service that stores connections between our local trains and trains in the MCC system. If there is a connection, then from the actual data storage service we get all the information about the passing of control points by the train. If there are no connections in our repository, or there are no actual data for these connections, then we make a request for obtaining data by train number and by passing stations. If in the second case the data were found, then we establish a new connection between the train found and ours, for which the search was performed.

In order to correctly mix the actual route data to the route, it is necessary to understand which fact relates to which station. In the actual route data, the station identifies the ESR code — the station number in the Express-3 system; in our route data, the station has its own identifiers.

For station matching, requests are made for another 3 services: the communication service of our stations and stations in the MCC. For these identifiers, a request is made to the local data storage service of the MCC for complete information about the stations (with ESR identifiers). In addition, a request is made to the service of local stations, from which Express-3 codes are obtained. Knowing the connection stations, we build the route of the train on the current date with the planned passage of the stations and the data of the actual passage of these stations.

In the first version, the service worked exactly like this - when opening pages, requests were made for each train in the schedule, and there was no caching. At the same time, the overwhelming majority of requests we have for Moscow destinations are “for today” (more than 100 electric trains run here every day).

In general, when the mechanism was turned on for production by 5% of users, we received a load of approximately 6000 rps in total for all the services used. For such an infrastructure was not ready. It was necessary to optimize the work, reduce the total number of requests. We solved this problem as follows:

Limit the number of trains for which requests are made. They now do not go to all the trains on the current day, but only to the nearest ones: which will go within a few hours or have already left. This reduced the requests by about 2 times. But this was not enough.
Next, we did not download data for each individual train, but for a pack of trains. There have been attempts to vary the number of elements, but another problem has surfaced - if you increase the pack size, this significantly slows down the loading of pages, since the data is not generated in parallel, but in series.
The next way is to cache static data (train and station connections). This improvement has already greatly facilitated the overall picture - we were able to turn on the issue to 100% of users. However, during peak hours (Monday morning and Friday evening) there were still problems and periodically resources allocated for servers were running out.
Then we added a cache with summary data: the train route + the actual route with all the computed matches. This greatly accelerated the average page load. But, since we receive the latest actual data every few minutes, and the service that makes the final calculations does not have an idea about the trains for which new data were obtained, and for which not, we had to check the time since the last data load. If this time is more than what was previously, the cache is reset completely. This had an effect, and the average page load time was greatly reduced. But at the moments of receiving data and, accordingly, flushing the cache, due to the huge number of requests, the load time still jumped. At peak hours, the chart looked like a cardiogram.
And, finally, the final cache was added to the data storage service - and now it completely leveled the situation. That is, now the result of the query to the database is cached, and when new data is received, it is updated only for those trains for which these data were obtained.

As a result, the scheme of the calculation of the route of the train with the actual data is as follows:

It works on every page of the schedule for the current date for 30-50% of trains and gives an overhead per page load of up to 100 ms at 90 percentile.

Data analysis

Based on the data obtained, it can be concluded that the train is late or is ahead of schedule.

First of all, we analyze whether there are any actual train data for a particular train, when it should already be. If they are not there, then we write that a delay is possible, because the train should have started from the initial station more than 5 minutes ago, but this did not happen.

If there is data, then data from the last or last two stations are analyzed (for reliability), for which there is actual tracking data. If there are deviations from the schedule, the train is marked "possibly late / very late / ahead." In addition, the reason is written - for example, that the train arrived at one of the previous stations with a delay of several minutes.

Another small but useful improvement is this. Suppose we have a train Semenov - Nizhny Novgorod, which passes through 2 stations - Tarasiha and Kiselikha. From Semenov the train starts at 9:20, arrives at Tarasikhi at 9:43, at Kiselikha at 10:11, and arrives in Nizhny Novgorod at 10:32, or rather should arrive. But according to actual data, we see that the train reached Tarasikhi only at 10:02 (that is, he is already 19 minutes late). The frozen passenger on a deserted platform in Kiselikha looks at the timetable at 10:25, and if we only showed the timetable data, then the person would have decided that the train had already left. But, since it’s impossible to catch up with the schedule in 19 minutes in practice, we show this train as not yet gone from Kiseliha.

If according to the latest data received, the train is on schedule, then the following stations are analyzed, for which actual traffic data usually arrive. If there is no data on them in time, then, most likely, the train stopped somewhere on the stretch and we show a warning that it has not arrived at the intermediate station, although it should.
But that's not all. It happens that a person is watching a schedule from Moscow, standing on the platform of Chukhlink. Here the story of the actual following does not work, since Moscow is the starting point of the path. There is either no information here or it is there, but this means that the train has already left. In this case, we analyze the movement speed of the trains.

Turnovers

In most cases, one electric train makes not one trip per day, but runs along the route all day. At the same time for each trip he has a number change, a crew of operators may change. In all systems with a schedule (both ours and the MCC) these are different objects. However, in many cases, the systems have information on which flight the electric train will be sent to in the future. The composition itself may be different every day, but it is the bundle of flights that is fixed. And, if the train travels late on the first trip, then the second trip will also start late. There are exceptions to these rules, for example, if the delay is very strong, then another compound may be allowed on the second trip, but in practice this happens quite rarely.

Having data on such connections, we also analyze data on the actual train traffic, which are served by the same train, and show a warning of possible lateness. But this is done only if the delay of the previous flight can affect the one that the user is watching.

In general, the next time you gather from Tarasiha to Kiseliha - be sure that we will let you know that the train is late.

Source: https://habr.com/ru/post/353710/

All Articles