Where is the money on the road (an algorithm that allows one and a half times to reduce costs in a taxi)

Probably, now there are very few people who have never had to use the services of a taxi, especially since recently the popularity of this type of transport is growing, and the price is steadily falling. On personal experience, I can say that this is a convenient way, for example, to go with a small child to the clinic.

Have you ever wondered what we pay for using a taxi?

Undoubtedly, the main part is the driver’s charge and costs for the car, but it would be rather rash to assume that this fee applies only while you are on the road. When the taxi driver has to wait for the next order for a long time, who will pay for his time and idle time? - ultimately we are with you. A taxi driver by right will not agree to a reward for a day below its value in the labor market, while he performs an average of five orders or twenty-five. It became interesting to me, how much idle time is not profitable for anyone, and how it can be reduced. Below I would like to share with you the results of my own research on this issue.

How does a modern taxi driver work?

Receives an order from the dispatcher or through the application for taxi drivers, takes the passenger to the desired place, and most often waits for the next order near the landing site.

What is bad such a policy?

It can lead to a situation where in some areas of the city there are too many taxi drivers who have to queue for the next order for a long time, while
in other areas there is a shortage of free cars and part of the orders are therefore lost. In the most acute form, the described problem arises as a result of daily migrations: in the morning, taxi drivers, along with the flow of working people, “flow away” from residential areas, where at that time there is also an increased demand for their services, and accumulate in the center, where and especially few who wish to return home.

Are previous conclusions plausible reasoning or are they somehow tested in practice ?

When I thought about this for the first time, I relied solely on common sense and little experience in dealing with taxi drivers that brought me up. However, at that moment, when I had the first sketch of the algorithm that optimizes the work of the taxi service, I certainly wanted to test my hypotheses and evaluate the effectiveness of the solution obtained. For obvious reasons, none of the major market players, such as Uber, Gett, or the “domestic” Yandex.Taxi, would like to reveal the information I need. Fortunately, the administration of a distant Chicago has long been collecting and making publicly available data on all trips of a significant part of the city taxi services. (For more, see "Notes on data").

What was the state of affairs in Chicago in 2016?

According to my estimates (for details on their methodology, see Notes on Estimates of Service Time), of the time spent by all taxi drivers at the workplace, at best, only 46% of it was order fulfillment. As for the negative manifestation of the effect of daily migration, the average number of machines available within administrative districts is 30% more or less than the number of incoming orders at this time.

What can be improved and how?

Two obvious lines of action suggest themselves:

1) To organize the transfer of cars from areas where there are too many of them, to areas where there is a shortage observed or expected.
2) Maintain the optimal number of machines awaiting order in each area.

Of course, you can set a more general goal of maximizing profits, and in the course of solving it, you can even refuse to serve some of the “unprofitable” routes, but this strategy is fraught with a company for losing adherence to generally beneficial users if the competing player serves them through the “unprofitable” routes . Given this possibility, it seems reasonable enough to attempt to serve all the available orders, but with the least cost.

What requirements should be submitted to a good traffic redistribution algorithm?

Here, in my opinion, the following conditions are desirable:

a) The total time spent on the transfer, since it is not paid for by passengers, it is advisable to make as little as possible (In the program I wrote, for example, special minimizing algorithms on graphs are used for this, but about them later);
b) Since the prosperity of drivers most often directly depends on the volume of orders fulfilled by them, it is desirable that the overhead costs of implementing the algorithm fall on each of them equally. It should also be noted that a driver who has received an order to move to another area will rather fulfill it if it takes 15 minutes, and rather ignore it if it is 50.
c) The degree of deficiency in the number of free cars in each area changes during the day, so the algorithm must correct the imbalance with sufficient efficiency and, when planning transfer flows, calculate them taking into account the situation in the future.
d) The computational complexity of the algorithm should make it possible to put it into practice.

A little trick, greatly simplifying things.

Imagine twenty soldiers standing in front of you in a line. How to ensure that the place in front of the first became occupied, and the place of the latter was freed? One way is to ask the one who is standing last to move forward. Another way to do the same, but twenty times faster is to ask all soldiers to move only one position. This idea, applied to taxi cars, means the following: instead of transferring them from the surplus area A directly to the deficient area B remote from it, you can “move” cars by just one position along the chain of areas leading from A to B. hand, allows almost instantly, compared with the rate of change of the deficit on the map, to eliminate the imbalance, on the other, requires drivers to spend on the "shifts" every time no more than 15 minutes.

What is the mathematical method of implementing the traffic redistribution algorithm?

(Here I will describe in detail the technical details of the algorithm I developed, the reader who is interested only in results can skip this part of the article.)

Let us try to reduce the problem to a well-known problem on the graphs of finding in a production-consumption network a supply stream of minimal cost. Since the demand for taxi cars and their distribution in the city depend on the time of day and day of the week, we will build a separate graph in the manner described below for each day of the week and the time interval per hour. The top of the graph will be administrative districts. Those of them in which the expected number I released after order fulfillment exceeds the estimated number of orders U will occupy the role of production centers with power S = I - U, the rest - the role of consumption centers with power - S = U - I. We shall call the value S imbalance peaks (district). We assume that there is an edge from area A to area B, if (and only if) from A to B can be reached on average in time t, not exceeding 15 minutes, while t is the cost of this edge (for details, see "Notes about the estimated travel time between districts "). In the classical problem of a production-consumption network, restrictions on the skipped flow value have graph edges, but in the case under consideration it is difficult to assume that taxi drivers alone can cause a traffic jam, but twice in the same interval between orders it would have been a bad tone from our side. By virtue of the last remark, the "shear" flow from the vertex should not exceed I.

Is there a way to transform a graph in such a way that it is possible to apply algorithms to the standard task of finding a stream of minimal cost?

Let's select any vertex and replace it with two new ones so that all the edges that belonged to the original one came into the first one, and all that went out of the original one came out of the second. It remained to connect these two new vertices with a single edge with zero cost and throughput I and attribute imbalance S to the first, and zero to the second. Having done this transformation with each vertex of the original graph, you will translate it to the form that is standard for the minimum cost flow problem, but this is not all.

In order for the task of searching for a flow that satisfies the needs of all "consumption centers" and realizes the capacity of all "production centers" in general, the total imbalance of all vertices must obviously be zero. Can the last property be fulfilled for taxi service counts? It looks very plausible that in the graphs built for the morning hours, when the need for taxi cars is growing, the total imbalance will be negative, and for the evening - the opposite. Let's imagine how a taxi service can cope with this problem: it would be reasonable to introduce such a schedule for taxi drivers, in which the number of cars involved would always adequately correspond to the demand. To take this into account, in my algorithm I added an additional vertex “house” to the graph with high cost edges (40 minutes) going to each district, and a minor one going back.

By the way, if you ever need to numerically solve the task of finding a stream of minimal cost, I strongly advise you not to resort to dubious sources like Wikipedia or various forums that advise the outdated Ford-Bellman algorithms or the potentially looping network simplex method. Algorithms good in terms of operating time can be found in this article.

“Under the hood” of the programs attached to the article works my version of the Edmonds – Karp algorithm.

What are the relative costs of implementation of the redistribution algorithm in the city of Chicago?

On average, for a week, taxi drivers spend 275 million seconds on fulfilling orders in Chicago, idle waiting for 315 million seconds, while the redistribution algorithm would take them only 30 million seconds.

Is it possible to assume that all the time costs in the ideal organization of transportation is the time to fulfill either orders or the prescriptions of the traffic redistribution algorithm?

Unfortunately not! Suppose, for example, in a certain area, on Monday between 25th and 10th am, 25 orders are expected, 15 vacant cars released from the disembarkation of passengers and 10 more, additionally transferred from surplus areas. It would seem that the number of free and required machines is in the balance for this period, is it worth it to expect to be able to complete all the orders? Even if the expected numbers coincide with the real ones, the order in which the free machines will arrive and the applications for them are, generally speaking, arbitrary. To guarantee the availability of a free machine at the time of order receipt, the latter will always have to keep a small margin. The stock of machines, which is not likely to run out during the delay in their redistribution (about 15 minutes), is (Poisson distribution) approximately two square roots of the expected number of orders for this period.

It is rather unexpected that the time lost in such queues is much longer than the time spent on the execution of the redistribution algorithm and is approximately 114 million seconds per week. Is it possible to somehow reduce it?

You can think of something. One of the ideas is the following: if taxi cars that deliver passengers to a certain area arrive at an average interval of a minute, then you should not give up the next order, even when there are no free cars at the time of their arrival: after three minutes at most one such surely would appear. True, this wait-and-see strategy has one drawback - the car’s “filing time” increases. In my algorithm, I reduced the queue size in each area by the average number of cars arriving there in 5 minutes. Such a measure should only in rare cases increase the “filing time” by five minutes, but the total loss of idle time in queues according to the calculations should be reduced by more than two, reducing them to 50 million seconds per week.

What is the final result?

For Chicago, in accordance with the calculations (you can find their programs at the links below), using the redistribution algorithm for free cars in combination with controlling the size of their queues in each particular area would increase the proportion of time spent on passenger transportation from 46% to 77%. If the calculations are correct, then their introduction in Chicago would save about $ 15 million per year, and in cities the size of Moscow, about $ 100 million - money that now just remains on the road and does not benefit anyone.

Code projects related to this task are available here .

Data notes

In the calculations, I used the data for the entire 2016. Two months ago there were more fields in the dataset and among them there was a lot of information about the exact coordinates and payment for the order. Apparently, some of them were hidden in the commercial interests of taxi services, and some because of problems with personal life. I did not need all the fields available at that time, and the file with which I finally started working can be found on the link . One entry in it consists of:

order number cipher
machine number cipher
the date and time of embarkation and disembarkation, rounded to the nearest marks on the clock, multiples of 15 minutes,
the length of the way is in miles
lead time in seconds
administrative areas of Chicago, in which the landing and landing, respectively,
approximate geo-coordinates of the landing site and the landing site.

This file has a size of almost eight gigabytes, and a good two-thirds of them are occupied by inappropriate long ciphers. To make it easier for me to work with the data, I replaced the ciphers with their ordinal number during dictionary ordering, translated some numbers from text format to numeric format and saved the records as objects of the user class "date unit". Along the way, I threw out a small number of records with completely unrealistic data. Compact packaging has reduced the space for the same amount of information to 750 megabytes. The latest file can also be found by reference , and the storage format for example in this program .

Note on the estimated service time

The average lead time for orders in a week I received, summing up the lead times for orders indicated in the record of each trip and divided by 52. As you can see, there is no magic. To find the idle time, I first formed an order-based list of completed orders for each machine and subtracting from the start time of the next order in the list the completion time of the previous one, I received new lists of so-called "timeouts". There were two difficulties. As I mentioned in the notes about the data, the entries about the embarkation and disembarkation times are not real times, but their rounding to the nearest marks on the dial that are multiples of 15 minutes. The second difficulty was not to take breaks between work shifts as idle. The second one I decided to unload the distribution of "timeouts" of random 30 cars, from which it was clear that waiting times of the order of 15-45 minutes were typical, and often there were hourly delays. From this I made a conclusion to include in the calculation of idle time only “timeouts” for less than an hour and a half, and ignore the rest. I bypassed the first difficulty simply by summing up all the timeouts taken into account. Why can this be done? This could be an interesting problem in probability theory — to show that if a pencil is long L randomly arranged on a ruler and rounded the coordinates of its ends to the nearest integer, then the mathematical expectation of the difference of these rounded values will be equal to the length of the pencil.

Notes on estimating travel time between districts

The first idea of such an assessment was to average the time over all available records of trips from one area to another for a given time of day and day of the week. However, it seemed to me that such a value would be slightly overestimated due to all sorts of baggage laying, disembarking and embarking passengers, as well as traveling to the house porch through narrow streets, where there is absolutely no need to send cars, realizing just their transfer. Therefore, I simply assumed that the order time is a certain delay constant plus a value proportional to the distance specified in the order. Constant I cut-off using the method of least squares (see this program), and then averaged only the second term. If the proportion of the constant in the time of the order was more than a third, I deducted only a third.

Source: https://habr.com/ru/post/328522/

All Articles