We strive to ensure that after ordering a taxi to the user came a clean, serviceable car of the brand, the color and number that are displayed in the application. And for this we use remote quality control (DCC).
Today I will tell Habr's readers how to use machine learning to reduce the cost of quality control in a rapidly growing service with hundreds of thousands of machines and not to release a machine that does not comply with the rules of service.
DCC process diagram
In the process of DCC, we check the photos of the car and decide whether it is possible to carry out orders for such a car or, for example, it is worth washing it before. It all starts with the fact that through the driver's application Taximeter we call the driver on the DCC. This usually happens once every 10 days, but sometimes less often or more often, depending on how well the driver passed the previous checks. Immediately after the call to the DCC, the driver receives a message inviting him to pass photo control. As soon as the driver accepted the invitation, in the same application he photographs the exterior and interior of the car from different angles and sends the photos to Yandex.Taxi. The driver can take orders while the DCC is in progress.
Start screen DCC in the application Taximeter
Screen photographing the car in the application Taximeter
The resulting photos get into Yandex. Toloku - a service where you can quickly perform simple, but large tasks using crowdsourcing. About how it is arranged and why Yandex.Tolok is needed, we wrote in our blog .
In Yandex.Talock, in the course of a single check, at least three performers answer questions about the condition of the car, and if the performers reach a consensus, based on their answers, a decision is made whether the driver can take orders. There are two outcomes for checking in Yandex.Stalk:
If the performers do not come to a common opinion, the photos are sent to the employees of Yandex.Taxi - assessors who check the car more carefully and then make the final decision. Assessors go through a special training program and have more experience.
So DCC sees Yandex.Toloki
With the growth of Yandex.Taxi, the number of DCC checks is growing, which means that costs for toloker and assessors are rising. In addition, the speed of checking the car drops. While the DCC is underway, it is possible both to allow drivers to take orders and not to allow them. Both options have their drawbacks: in the first case, an unscrupulous driver will have time to take several orders for a car that does not meet the standards, in the second case, all drivers called to the photo control will not be able to work until the check is completed. Therefore, it is important to check cars quickly so that users and drivers do not experience any inconvenience.
Observing how cost charts and average test times grow, we realized that we want to reduce Toloka’s costs, unload assessors and shorten the average test time, in other words, automate some checks. Naturally, we did not want to sacrifice the quality of service and skip more inconsistent with the quality standards of cars on the line, and also did not want to limit the acceptance of orders by bona fide drivers. We needed to automate the DCC and at the same time not increase the proportion of errors in the general flow of checks.
DCC process diagram with ML inside
For a start, we decided on the problem statement: to automate as many checks as possible without increasing the error rate in the general flow.
Let's deal with the fact that there are errors in our problem. They come in two forms: false positive and false negative . In our terminology, negative is the result of a check with which the driver can continue to work, and positive is the result that entails a temporary restriction on accepting orders. Then false negative is a case in which we were forced to allow a driver with a bad car to take orders, and false positive - on the contrary, when we were not allowed to work for a driver who is doing well with the car. It turns out that False Negative Rate (FNR) is the share of drivers with “bad” cars that we are allowed to accept orders, and False Positive Rate (FPR) is the proportion of drivers that we are not allowed to work with, although they are fine with cars. Thus, from the introduction of machine learning into the system we would like the following: to automate as many checks as possible, without increasing FPR and FNR in comparison with the system without machine learning.
Further, it was necessary to understand which metrics to focus on when choosing models and thresholds for making decisions based on their predictions. From the conditions of the problem it is clear that we are interested in three quantities:
We maximize the first value while respecting the limits for the second and third.
Then the question may arise: why not maximize the savings of money or minimize the average scan time directly, and not through the share of automated checks? Optimizing money is a very attractive idea, but usually difficult to implement. In our case, the savings consist of two factors: the first is the savings from each automated check, because each check at the assessors or at Yandex.Talk is worth the money; the second is savings from reducing the number of errors, because each error costs Yandex.Taxi money. Objectively, calculating how much mistakes cost us is a very difficult task, so we are limited to calculating savings only for the first factor. Such a value monotonously increases in the proportion of automated checks, so that it is possible to maximize this share instead of saving. The same reasoning applies to the average time of the DCC, it also monotonously decreases in terms of the proportion of automated checks.
It can be said that checking the DCC is reduced to the choice of answer options for a number of questions about the condition of the car from its photographs, and this sounds like a problem of image classification. Such tasks are solved by computer vision, and in our time - a specific tool, convolutional neural networks. We decided to use them to automate the DCC.
Now that we understand what to optimize and why, it is time to collect data and train models on them. It was easy to collect data, because all DCC checks are logged and stored in the storage in a convenient form. In the first variant of the solution, the photos of the exterior and interior of the car from four angles, the make, model and color of the car, as well as the results of 10 previous DCC checks, served as signs. As target variables, we took answers to all verification questions, for example: “Is the car damaged?” Or “Does the color of the car match the one specified in the driver card?”. The main target variable was the answer to the main question: “Is it necessary to limit the ability of the driver to take orders?”. We learned one big model, very similar to VGG with SENet attention, to answer all the questions at the same time and as a result we ran into several problems.
“All at once” approach
Problems of the “all at once” approach:
Together, these four reasons did not allow us to apply the first solution in practice, but we did not lose heart and came up with a second option.
We decided to focus on the checks of the exterior of the car, because they make up about 70% of the total flow. In addition, we decided to split the general task into subtasks and learn how to answer all the DCC questions separately.
“Everything but gradually” approach
Once upon a time, our service already engaged in DCC automation and managed to implement a model that allows filtering dark and irrelevant photos. We continued to use this model further to answer the question: “Are the following real photos of the car present: front, left side, right side, rear?”.
Our work on the second solution began with the fact that we used the Yandex.Search computer vision service model (from the very people who made DeepHD ) to recognize license plates on cars. So we were able to answer the question: “Do the car’s region number and code correspond fully to the driver’s card?” In more detail, we compared the recognition result with the number indicated in the driver’s card and, depending on the Levenshtein distance between them, chose one of answer options: “the number is the same,” “the number does not match,” or “the question cannot be accurately answered”.
Next, we trained car classifiers to recognize brands and models, as well as colors. From this point on, we could answer the question: “Do the make, model and color of the car match the driver’s one?”
At the end, we trained the classifiers to find damaged and dirty cars; this allowed us to close the questions: “Are there any damages or defects on the car body?” And “How dirty is the car body?”
The “everything but gradually” approach allowed us to solve the problem of checking the car number. We were also able to get rid of the incompleteness and noisiness of the target variable, because now we had a sample where objects of class negative were completely successful checks, and objects of class positive were checks, where the assessor or all three Yandex.Toloki performers found a certain defect, for example, damage to the body . After solving the first two problems, our models became interpretable, and we could explain to the driver the reason for the restriction, so that by the next test he would correct the shortcomings. The overall quality of the answers to the questions also greatly increased, and FPR and FNR for some combinations of model thresholds dropped to the level of Yandex.Toloki, which allowed the models to be introduced into production.
We had a choice: to launch a regular process that will apply models to checks accumulated in the queue, or make a separate service where you can go through the API and receive model responses in real time. Since it is important for us to quickly find "bad" cars, we chose the second option. As soon as the main part of the service was written and it was able to support the necessary functionality, we began to add models to it.
To fully approve the check, you need to be able to answer all questions of the instructions, but in order to restrict an unscrupulous driver to access the service, in some cases it is enough to be able to answer at least one question. Therefore, we decided not to wait until all the models are ready, but to add them as soon as they are ready. Generally, the add-in model add-on looks like this:
This approach allowed us not only to instantly find more and more “bad” cars as new models were introduced, but also to measure the quality online without additional time, while the models were working in the background.
In the end, the moment came when we added to the service and tested the latest model. Now we could answer all questions of checks, and therefore automatically approve them. Since the “good” cars in Yandex.Taxi are much more than the “bad” ones, automatic approval of checks led to a sharp increase in our main metric - part of the stream of automated checks. We had only to choose the right thresholds that would maximize the proportion of automated checks, while maintaining the overall FPR and FNR of the entire system at the same level. To select the thresholds, we used a sample that was independently marked by the Yandex.Toloki performers, assessors and a Yandex.Taxi employee who trained assessors to check cars. We used its markup as the true values ​​of the target variable.
As soon as we included the models in the production, it was necessary to measure the online quality of the decisions made on the basis of their answers. And what numbers we saw:
Thus, the introduction of machine learning not only helped save money, but also made it possible to make the service safer and more comfortable for users. However, this is far from the end of the story. Our rapidly growing team will continue to work actively to automate even more checks and make Yandex.Taxi even more convenient, comfortable and secure.
In working on DCC automation in Yandex.Taxi, we encountered many problems, found several successful solutions and made six important conclusions:
Source: https://habr.com/ru/post/433386/
All Articles