I have already shared a
story about our experience in using artificial intelligence in a search on hh.ru, and today I would like to dwell on measuring the quality of this search in more detail.

For the normal operation of the search is extremely important system of metrics - local, A / B-tests, queues at the prode, etc., and this system requires special attention and resources. It is wrong to think that it is enough just to sip cool ML and screw all these metrics with "scotch tape"; it is also not enough to measure the quality of work of an already operating system — it doesn’t really matter whether it uses ML or is out of the box Lucene.
We abandoned the old search solutions not because they seemed outdated to us, or because the ML is stylish, fashionable and youthful. The old search lacked local quality metrics that could be used to measure the benefits of changes before launching them into lengthy experiments; moreover, it was not clear what to change and measure in order to organize a process of continual improvement.
When we started building a search system on ML, we immediately provided for a system of local metrics. In the development process, we compared the quality of the new search on ML with the scores from models that predict the likelihood of a response, with the quality of the old keyword search, which used only the scores on textual matching of the query and the vacancy. For this we used the usual local metrics: MAP, NDCG, ROC-AUC. In addition, in the process, we expanded the number of metrics and cohorts in the A / B tests and covered the new search with autotests. In this article I will talk about how we monitor the quality of work of our recommendatory models - it is quite possible that the HeadHunter experience will be useful to you, because, again, it’s not that important whether your search is based on ML or not.
')
Statistical tests
First of all, we began to measure the quality of models with the help of local metrics MAP, NDCG, ROC-AUC and noticed a significant improvement from switching from keyword search to ML-based search. This is explained by the fact that a traditional search based on Lucene or Sphinx cannot predict the probabilities of targeted actions and rank them. For example, he does not know how to take into account the role of the salary indicated in the vacancy and in the applicant’s resume; does not correlate key skills in a resume and in the requirements for a vacancy; does not take into account semantic relations when comparing words. This can be seen on the search quality metrics, if we compare the Lucene text matching scores with the scopes from the models, which are selected using ML and provide ranking and filtering by the probability of response and invitation:
Metrics | Search by key. according to | ML search |
Area under the ROC curve | 0.608 | 0.717 |
Mean Average Precison | 0.327 | 0.454 |
NDCG | 0.525 | 0.577 |
Local metric values ​​can predict grocery values ​​as well as they are correctly measured by these local metrics. For example, when switching to split splits in time and user while cross-validating, the values ​​of metrics decreased, but they began to better predict future changes in A / B tests.
Over the past year, improving the quality of search and recommendations, we have increased the success of search sessions in the application, on the mobile website and on the desktop by an average of 22% (the failure on the chart is the New Year holidays).

Autotest
After we expanded the coverage of unit and smoke autotests. For example, we look at smoke autotests by high-frequency queries ([accountant], [driver], [administrator], [manager]) and the work of the model with reference user resumes from the reference database - so that every time we release, we see they did not break down the search and there are corresponding vacancies on the request of the “sales manager”, and there are no vacancies on the first pages, for example, vacancies for project managers.
A / B
The main purpose of the A / B testing system for us is to control and make decisions (whether to roll out a new model, interface, etc.). To control (quality control of an already working model), we conduct reverse tests when an old model is included as an experiment. So you can be sure that the current model is still better than the old one.
We have been using our own A / B test system for quite some time. For example, after the very first launch of the alpha version of recommendations on ML, it allowed us to see that the success of recommendations increased by 30%. By the way, we examined the quality of the A / B testing system and the used metrics separately in the
article .
Performance
But the “victory” of the new model in local metrics or in the A / B test does not mean that this model will work in the future: the model may be too resource-consuming, which would be completely unacceptable for hh.ru, a highly loaded site. To measure the resource intensity, we made monitoring of all stages of calculating the document skoi.
The graph shows the time spent searching in each stage. It can be seen that the new model turned out to be too heavy - it had to be rolled back, optimized signs and rolled out computationally easier.

And other indicators
The most important task of the search and recommendation system is the selection of vacancies to which the user will respond with the highest probability. We want the number of responses to vacancies to increase and people quickly find a job. Therefore, in addition to the CTR and the number of successful search sessions, the most important indicator of search performance was the absolute number of responses to vacancies. When the new model was turned on, the number of responses began to grow sharply: now on hh.ru, on average, users make more than 600,000 responses to vacancies per day. This is a floating indicator - there are days when we record more than a million responses. We can also consider success as adding a vacancy to a candidate as a candidate or, for example, viewing contacts in a proposed vacancy.
At the end of this story, I would like to step aside a little and voice another conclusion to which we arrived at the creation of a new search: the quality is not enough to measure, it needs to be embedded in the product initially. In addition to understandable metrics, this is facilitated by the correct formulation of tasks so that you do not have to redo it, proper planning, ensuring quiet work without rush jobs, respect for the team, ideas and time. It is in these conditions that will be measured.