📜 ⬆️ ⬇️

The company has more similar jobs

On March 2, I gave a talk at Data Science Meetup, which was held in our office. I talked about the experience of creating an algorithm for collapsing similar vacancies in search results. Under the link you can familiarize yourself with the report on the last meeting, recordings of speeches and links to presentations will also be available there. For those who prefer to perceive information in text form, I wrote this article.


We ran into a problem when in the search for vacancies the issue was filled with the same vacancies from the same employer. For example, at the request of a “driver”, a visitor could receive 30–40 variants of the same vacancy for the same position.




Where do such vacancies come from? From large network companies, which we have a lot in Superjob. As a rule, they hire a large number of staff for the same positions. Or, recruiters “propagate” vacancies consciously, hoping, having covered a greater number of search queries, to find the right job seeker faster.


In this case, the job seeker has to flip through entire pages of similar vacancies, and vacancies that might interest him are quickly crowded out from the first pages of the issue (which, however, also causes dissatisfaction with the employers who posted them).


The solution would be to remove from the search results all doubles and leave only one vacancy of the “driver”.


This would help to solve the problem of the applicant and would save us from customer dissatisfaction. How can this be achieved?



Let us examine each item in more detail.


What is a vacancy?



This is a semi-structured text, parts of which are heading, requirements, duties, etc., plus attributes such as salary, work address, work schedule, and so on.


Superjob and previously implemented algorithms for identifying duplicates, based on the full or partial coincidence of attributes or other parts of a vacancy, but they all did not give a good enough result.


This time we decided to try slightly more advanced algorithms to evaluate the lexical similarity of vacancies.


LSH


Locality-sensitive hashing (LSH) functions are those functions whose purpose is to maximize the likelihood of collisions of similar function arguments. A minor change in the function argument, for example, the vacancy text, should lead to a slight change in the result of the hash function.


One of the most frequently used representatives of LSH functions are the functions SimHash and Minhash. These are hashing algorithms that convert text into a list of values, which ultimately represents the signature of this text.


In the case of SimHash, the list of values ​​is just a list of bits (values ​​0 or 1).


In the case of MinHash, this is the same list of values, but the value in the list is the minimum value of the hash function of each of the words relative to the given hash function. The number of hash functions is given by the requirement of the largest duplicate detection error.


The main difference between both algorithms is the probability of collisions. In the case of SimHash, it is equal to the cosine similarity of the word frequency vectors in the documents, and in the case of MinHash it is equal to the similarity of Jacquard.


I also met the opinion that MinHash detects copy-paste better, while SimHash - plagiarism.


We decided that the cosine similarity more meets our requirements for determining the similarity of vacancies, and opted for the SimHash algorithm.


Description of SimHash algorithm:


  1. Determine the size of simkhesh.
  2. Create an array of integers, filled with zeros, with a size equal to the length of the hash in bits.
    hash = [0,0,0,0,0,0,0,0] 
  3. We divide the source document into words, for each word we calculate its hash using any hash function (md5, sha1), which returns the result of the same length.
     doc = [“”,“”,“”] doc = [01111001, 00110101, 00101110] 
  4. For each bit of the resulting hash, increase the corresponding array element by one if the source bit is 1, and decrease it otherwise.
     01111001 00110101 00101110 + hash =[-3,-1,3,1,1,1,-1,1] 
  5. Based on the resulting array, the hash result is generated as follows: if the array element is greater than zero, then the corresponding bit of the simhash is set to one, otherwise to zero.
     hash =[ 0, 0,1,1,1,1, 0,1] simhash(doc) = 00111101 

What can be done now? You can determine how similar the documents are using only their signatures.


To do this, it is enough to “proxy” them, as a result of which we will get the number of positions in which these signatures differ.


  00111101 XOR 00111000 = 00000101 

You can get a numerical expression of the similarity of two vacancies by calculating the ratio of matching bits to the length of a simhash. For the above simhesh, it will be:


Similarity index = 6/8 = 0.75 (75%)


Thus, we learned to evaluate the similarity of the two vacancies. Now we can generate hashes for all vacancies and find similar ones for each of them - say, differing by no more than 2 bits.


For example, if we find such hashes for h1,


 h1 = 00111101 h2 = 00111000 h3 = 00001101 

then we find that h2 differs from h3 by more than 2 bits.


Those. hashes in a group may differ from each other more than we would like. This means that hashes will fall into other groups and groups will have intersections.


Therefore, we define what properties groups of similar vacancies should have:


  1. all vacancies in a group should differ from each other by no more than a certain amount;
  2. adding, changing or deleting should not lead to a complete change in the composition of the groups (there is no main vacancy);
  3. the number of groups is not known in advance.

Clustering


Hierarchical clustering using the full link (far neighbor) method allows you to perform all three points.



First, we do not need to worry and select the optimal value of the number of clusters (in advance). Secondly, the full coupling method forms spherical clusters of constant radius, all points within which are located at a certain distance. Third, adding or deleting a point does not lead to mergers or disintegration of clusters.


There is one drawback - the complexity of hierarchical clustering O (n ^ 3). But, since we will cluster the vacancies of one client, we can afford not to pay attention to it in the hope that the number of vacancies for one client will be limited to reasonable values.


At the output, we get a dendrogram, namely, a graph without intersections of clusters nested into each other.



How can we get clusters of similar vacancies of a certain degree of similarity on its basis?



Cutting the dendrogram, we get N flat clusters that will meet our requirements. It is only necessary to determine in which place it is better to cut it.


To do this, you can apply the standard technique with maximizing the second derivative of the function of the dependence of the number of clusters on the cut length of the dendrogram.


The maximum value of the second derivative will point us to the optimal cut length of the dendrogram, since at this point the rate of change of the number of clusters slows down the fastest.



If we decide to cluster vacancies similar to each other at 80% ± 5%, it is enough to find the maximum value of the second derivative in this range and divide the dendrogram into clusters at this point.


Implementation


When creating or changing a job, send the task to the queue in rabbitmq for simhash calculation. After that, in another queue, we perform the task of clustering client vacancies. For each vacancy we assign the identifier of the cluster to which it fell. A full cycle of calculating simkhesh and clustering vacancies of all clients from the beginning to the end takes 5 minutes.


The cluster ID arrives at the Sphinx as a vacancy attribute, according to which the search results are grouped. Only vacancies with the best value of the ranking function in their cluster are included in the issue. If there is not one vacancy in the cluster, next to it we make a link that can be used to view the rest of the vacancies from the same cluster.


Ultimately, we achieved the desired result.



results


The resulting algorithm, of course, is not perfect, similar vacancies do not always fall into their groups of similar vacancies. But we conducted a / b testing of many metrics and got good results:


- Applicants began to respond more often to vacancies of a larger number of clients, more often from the first page of the search;
- decreased average and average-maximum depth of viewing search results.


We also found out that in 75% of searches it is possible to group similar vacancies and in 66% of searches a user saw similar vacancies.


')

Source: https://habr.com/ru/post/325014/


All Articles