📜 ⬆️ ⬇️

Analysis of the relationship of skills using graphs in R

Interesting, but such an area as professional development remains a little away from the noise due to data science. HRtech startups are just starting to build momentum and increase their share, replacing the traditional approach to working with professionals or those who want to become professional.


The scope of HRtech is very diverse and includes the automation of hiring employees, development and coaching, automation of internal HR procedures, tracking market wages, tracking candidates, employees and much more. This study helps with the help of data analysis methods to answer the question of how skills are interconnected, what specializations are, which skills are more popular, and which skills should be studied next.


Problem statement and input data


Initially, I did not want to share the skills according to some well-known classification. For example, through the average salary it was possible to distinguish "expensive" and "cheap" skills. We would like to highlight “specialization” based on mathematics and statistics based on market requirements, i.e. employers. Therefore, in this study, the task of unsepervised learning was to unite skills in groups. And we chose a programmer as the first profession.


For analysis, we took data from the Work in Russia portal, available on data.gov.ru. Here are all the vacancies available on the portal with a description, salary, region and other details. Next, we parse the descriptions and highlight the skills. This is a separate study and this article is not covered. However, already marked data can also be taken from the API hh.ru.


Thus, the initial data is represented by a matrix with values ​​of 0/1, in which X is skills and objects are vacancies. Only 164 signs and 841 objects.


Selection of a skill group search method


When choosing a method, we were based on the assumption that one vacancy can have several specializations. They also proceeded from the assumption that a specific skill can relate to only one specialization. Open maps, this assumption was required for the operation of algorithms that use the results of this study.


Solving the problem in the forehead, we can assume that if one group of skills meets one group of vacancies, and another group of skills - another group of vacancies, then this group of skills is specialization. And you can share skills using metric methods (k-means and modifications). But the problem was that one vacancy could have several specializations. And in the end, as if the algorithm did not change, he attributed 90% of skills to one cluster and about a dozen clusters of 1-2 skills each. Upon reflection, they began to rewrite k-means for the task in such a way that instead of the classical Euclidean distance, we consider the measure of contiguity of skills, that is, the frequency of occurrence of two skills:


library(data.table) grid<-as.data.table(expand.grid(skill_1=names(skills_clust),skill_2=names(skills_clust))) grid<-grid[grid$skill_1 != grid$skill_2,] for (i in c(1:nrow(grid))){ grid$value[i]<-sum(skills_clust[,grid$skill_1[i]]*skills_clust[,grid$skill_2[i]]) } 

But, fortunately, the idea came in time to present the task as the task of finding communities in graphs. And it is time to recall the theory of graphs, safely forgotten after the second year of university.


Construction and analysis of skills graph


In order to build a graph, we will use the igraph package (there is the same in python and in C / C ++) and first of all we will create an adjacency matrix from the table, which we began to consider for k-means ( grid ). Then we normalize the contiguity of skills in the range from 0 to 1:


 grid_clean<-grid[grid$value>1,] #    <=1   grid_cast<-dcast(grid_clean,skill_1~skill_2) grid_cast[,skill_1:=NULL] grid_cast_norm<-grid_cast/colSums(grid_cast,na.rm=T) grid_cast_norm[is.na(grid_cast_norm)]<-0 grid_cast_norm<-as.matrix(grid_cast_norm) grid_cast_norm[grid_cast_norm<=0.02] <-0 #    <=2%  

The adjacency of skill i and skill j are normalized as a fraction of the total occurrence of skill i . Initially, we normalized the matrix as a fraction of the maximum occurrence of all skills, but then we moved on to this formula. The idea is that, for example, skill i is encountered with skill j 10 times, and no other skill is encountered. It can be assumed that such a relationship would be more significant (for example, 100%) than if we watched this occurrence from the maximum in a given matrix (for example, 100 - 10%).


Also, to clear the adjacency matrix of random relationships, we removed pairs that were less than 2 times or 2% of the total occurrence of this skill. Unfortunately, this reduced the skill set from 164 to 87, however, it made the segments more logical and understandable.


Then we create an undirected weighted graph from the adjacency matrix:


 library(igraph) library(RColorBrewer) skills_graph<-graph_from_adjacency_matrix(grid_cast_norm, mode = "undirected",weighted=T) E(skills_graph)$width <- E(skills_graph)$weight plot(skills_graph, vertex.size=7,vertex.label.cex=0.8, layout=layout.auto, vertex.label.color="black",vertex.color=brewer.pal(max(V(skills_graph)$group),"Set3")[1]) 

skills_no_cluster
igraph also allows us to calculate the main statistics by vertices:


 closeness(skills_graph) #         betweenness(skills_graph) #    ,    degree(skills_graph) #       

Next, we can display an adjacency sheet for each skill. This sheet can later become the basis of a recommendation system for the selection of new skills:


 get.adjlist(skills_graph) 

Selection of communities by the Multilevel method


On the topic of finding communities in the graphs there is an excellent work by Slavnov Konstantin . This article lays out the main metrics for the quality of community allocation, methods for isolating communities and aggregating the results of the work of these methods.


When true division into communities is not known, the value of the modularity functional is used to assess the quality. This is the most popular and generally accepted measure of quality for this task. The functional was proposed by Newman and Girvan in the course of developing a cluster vertex clustering algorithm. In simple terms, this metric assesses the difference in the density of connections within communities and between communities. The main disadvantage of this functionality is that it does not see small communities. For the task of allocating specializations, where a combination of 2-3 vertices can become a community, this problem can become critical, however, it can be overcome by adding an additional scale parameter to the optimized functional.


To optimize the modularity functional, the Multilevel algorithm proposed in the article is most often used. Firstly, because of the good quality of optimization, secondly, because of the speed (in this task it was not required, but still), thirdly, the algorithm is quite intuitive.


On our data, this algorithm also showed one of the best results:


NAlgorithmModularityNumber of communities
oneBetweenness0.2236
2Fastgreedy0.314eight
3Multilevel0.331eight
fourLabelPropogation0.25715
fiveWalktrap0.316ten
6Infomap0.31513
7Eigenvector0.348eight

In the igraph pact, the igraph algorithm is implemented by the function cluster_louvain() :


 fit_cluster<-cluster_louvain(skills_graph) V(skills_graph)$group <- fit_cluster$membership 

results


skills_cluster


As we see, we managed to identify 8 specializations (the names are given subjectively by the author of the article):



Specializations are closely interrelated. This is due to our basic assumption that a single vacancy may have several specializations. For example, the “Mobile Application Developer” specialization requires “Knowledge of network protocols” (71), which in turn is interconnected with “Local Area Network Administration” (32) from the specialization “Server and Network Maintenance”.


You should also understand that the data source is the portal “Work in Russia”, the sample of vacancies in which differs from hh.ru or superjob.ru - vacancies are biased towards vacancies with lower qualifications. Plus, the sample is limited to 841 vacancies (of which only 585 had marks of any skills), because of this a large number of skills were not analyzed and were not included in the specialization.


However, in general, the proposed algorithm gives rather logical results and allows for professions that can be quantified (naturally, the skills of a top manager cannot be distinguished in a specialization), answer questions about how skills are, what specializations are, which skills are more popular, and which skills should be learned as follows.


Anyone who read to the end, a bonus. The link can play with the interactive visualization of the graph and download the summary table with the results of the study.


')

Source: https://habr.com/ru/post/328868/


All Articles