Application on API hh.ru. We recommend vacancies for your resume

I recently published a post telling how you can just start using our API. I myself wanted to play around with the data that can be obtained from it, and I decided to write an application recommending current vacancies based on information from the resume. At the end of the article there will be a link to the result, where everyone can get a list of recommendations on their resume.

On hh.ru, you can search for jobs that are suitable for resumes, but they use our general search system, and I wanted to make a more personalized selection.

From the user’s side, this will be a regular website where you can log in using the HeadHunter API, get a list of personal resumes and view recommended jobs. From the developer’s side, a module is added that will download current vacancies via the API and a module that will build recommendations based on the collected vacancies.

Job collector

The algorithm is simple. Once in N minutes we get a list of vacancies published in the last N minutes. Here we must take into account that for a single request, no more than 500 vacancies are returned, so requests are made broken down into pages. The path to vacancies in the API looks something like this: / vacancies? Per_page = {} & date_from = {} & date_to = {} & page = {}. The lists do not contain all the data of vacancies, which means that each vacancy will have to be requested separately. It should also be noted that sometimes a large number of vacancies are published in a short period of time, so the program does not have time to download all the published ones in the allotted N minutes. So, download jobs in several streams. Who cares, here's a link to the rocking code on Github. Immediately I apologize for the quality of the code - I am not a “clowner”. This script is periodically run on the crown.
')
Separately, I want to note that at first I tried to save jobs in MySQL. But on my server there is a very limited amount of resources and there is no possibility when building recommendations to keep everything in memory, and it’s not quick to unload everything from MySQL. Therefore, we had to receive the data in parts, it took about an hour for recommendations. Then I decided to look for in-memory storage for jobs. The choice fell on Redis due to the presence of support in Python, ease of installation and use, the presence of support for data structures and state preservation when restarting.

Job Vectoring

It is logical that it does not make sense to keep all the job data. Therefore, when downloading, we convert a vacancy into a parameter vector. Under the parameters refers to the words that are included in different jobs. Each document (job) corresponds to a vector of the same length. Each element of the vector corresponds to a specific word and sets a value - the weight that this word has in the current document.

If you take all the words that are contained in all the vacancies, you get a too long list. It is important to find a limited list of such words that would maximally characterize all the documents. From these words we will make the dictionary.

Jobs belong to certain professional oblasts. I suggested that it makes sense to break vacancies by professional areas and extract the most important words for each professional area. To create a dictionary, I downloaded about 113,000 jobs. The same word can have several word forms. It would be nice to present them as one word. For this applies stemming - finding the basis of the word. In Python there is a good implementation ( PyStemmer ) that supports the Russian language.

import Stemmer stemmer = Stemmer.Stemmer('russian') print stemmer.stemWord('')  print stemmer.stemWord('')

After stemming, I broke all the documents into groups corresponding to professional areas. If vacancies correspond to several professional areas, then, of course, it will be in several groups. Each document within each group is transformed into a vector. For this, sklearn CountVectorizer will help us. He is given a list of documents for entry. He gets all the words from the list and counts how many times a word is found in a particular document. This will be the vector.

 from sklearn.feature_extraction.text import CountVectorizer corpus = ['aa bb cc', 'bb bb dd'] vectorizer = CountVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) print X.toarray() [[1 1 1 0] [0 2 0 1]]

Some words are too common in many documents and are insignificant. And some - on the contrary, are not as common, but they describe the document well or several documents. For compensation, TF-IDF is considered for each group of vectors. When calculating the weight of a word is proportional to the number of uses of the word in the document and inversely proportional to the frequency of use of the word in other documents of the collection. To calculate this measure in sklearn there is a TfidfTransformer . It takes as input vectors derived from CountVectorizer, and returns recalculated vectors of the same dimension.

 from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer corpus = ['aa bb cc', 'bb bb dd'] vectorizer = CountVectorizer(min_df=1) X = vectorizer.fit_transform(corpus) transformer = TfidfTransformer() X_tfidf = transformer.fit_transform(X) print X_tfidf.toarray() [[ 0.6316672 0.44943642 0.6316672 0. ] [ 0. 0.81818021 0. 0.57496187]]

After we calculated the TF-IDF for the documents of each group, we consider in each group the arithmetic average for each parameter in the vectors. We find a certain number of parameters with the maximum value and save the words corresponding to these values. These will be the most important words for a particular specialization. I saved 350 words for each specialization in order to get a dictionary of approximately 10,000 words. A vector of this length will be characterized by each vacancy. Here is the complete code for creating a dictionary. Each document describing a job was composed of the words of the title, basic information and key skills.

Now we have a dictionary, using which one can turn each vacancy while saving into a vector corresponding only to these words. To do this, create a new CountVectorizer, parameterized by the dictionary.

Keeping up-to-date data

For each vacancy, the data of which we want to save, we perform data stemming, run through the CountVectorizer and TfidfTransformer and save to Redis. When I saved the vacancy vectors in Redis, I was faced with the problem of insufficient space in the RAM. I use vacancies for the last 5 days for recommendations, which is about 130,000. For each of them, you have to store a vector of 10,000 elements. I took this volume 7.5GB. So much RAM is not on my server. Then I thought that, since I save the data in json and they get very sparse, they are probably compressed perfectly. Therefore, before saving, I encode them in zlib . As a result, the same data began to occupy about 250MB.

Separately, I want to mention a couple of nice features of Redis:

The saved entries can be set to TTL, after which they are automatically deleted and do not have to worry about freeing up space.
One key can be mapped to a HashMap. So, together with the vector, for the vacancy, I save its region and RFP.

In "Python" saving in Redis looks like this:

 import redis import json r = redis.StrictRedis(host='localhost', port=6379, db=0) timeout = 5*24*60*60 data = {} data['features'] = json.dumps(vector).encode("zlib") data['salary'] = salary data['area'] = area_id r.hmset(vacancy_id, data) r.expire(vacancy_id, timeout)

For those who want to try Redis there is an instruction . Rises in a couple of minutes.

Recommendation system

When a user logs into the site, his resume is saved in the application. In order to compare resumes with vacancies, it must also be converted into a vector of the same dimension and with the same order of parameters. To create a vector, I took the text from the headline, key skills and the "About me" field. I also previously saved CountVectorizer with a dictionary and TfidfTransformer, trained on the job data that I downloaded at the very beginning. Using them, it is easy to get vectors for resumes.

To create recommendations, we find vacancy vectors similar to the resume vector. I used cosine distance as a measure of similarity. In sklearn there is a ready implementation .

For each resume we keep a list of the most similar vacancies.

Still it is necessary to take into account such things as salary and regionality. Therefore, we exclude for each resume vacancies from unsuitable regions, and if the salary does not correspond to a specific fork. Often the vacancy does not indicate the amount. Namely: 31% of the saved 113,000 did not contain an RFP. I decided that such vacancies should also be recommended.

The script for the selection of recommendations runs periodically on the crown. This means that you have to wait a bit to get a recommendation for your resume.

Site

Actually, this is the result of what happened . Try it. If anyone is interested in the source, then here .

What could be the problems of my approach. First, the lack of data in the summary: either because of their scarcity, or because of specificity. Also, the probability of a good recommendation is reduced for the regions. Performance could be improved through trade union areas and previous work experience. If this were a popular system, one could add an assessment of the quality of the recommendations in order to use it for further proposals. If summaries have responses and invitations, using this data can also increase relevance. It would also be great to look for matches not only for the same words, but also for related words. This task would help to cope with word2vec . But in any case, this is only a pilot version.

So, I wrote a job recommendation system for information taken from a resume. All data was obtained through the API HeadHunter. Use the API if you have a desire to make your service or mobile application related to HR topics. For problems or missing functionality please contact us on issues .

UPD : According to the results of ready-made recommendations, I realized that it was not worthwhile to include the results with a low coefficient of similarity. It also made sense to exclude from the results if there are too few parameters in the vector.

UPD2 : Added to the summary data from recent work experience. Now those who do not have enough data in the other fields should become better. But I don’t like this approach very much due to the fact that in the past experience it may be that the applicant is no longer interested

Source: https://habr.com/ru/post/303710/

All Articles