How we participated in the HR hackathon. Our graduates share their decision and impressions of participation.

Hello!

On November 23-24, a digital hackathon on data analysis in the HR sphere took place in Digital October, in which a team of graduates from our Big Data Specialist program won. Kirill Danilyuk, Igor Parfenov, Egor Andreev and Alexander Ivanovochkin share their decisions and impressions of participation.

Context

On November 23-24, a digital hackathon on HR data analysis took place in Digital October - we organized ourselves into a team (ai.at.work) and decided to like HR tasks. None of us had ever gone to hackathons before, so the interest was twofold.

The post turned out to be long, so let's not be distracted by the extra details and get right down to business :) By the way, we won the competition. So, we were offered the following situation: given the company "Chinikoff" selling appliances. Chinikoff is a large retail chain that is present in many regions and has more than 190 retail stores. The company is well-known, it has been working on the market for a long time (more than 20 years), it has more than 2000 employees and a developed system of personnel training. The company conducts both full-time employee training and online: it has a corporate portal with a library, self-test tests and webinars.

Obviously, one of the competitive advantages of the company is qualified and motivated for development employees, including sellers in retail stores. Therefore, we have two tasks:

Given the available data, find the best way to train employees.
Find some interesting patterns in the data, i.e., just “figure out” something useful from dataset.

Data

Unfortunately, the company did not allow us to publish the data that was provided to us, so we will have to describe the entire structure in words.

“Dataset” provided to us is just two Excel files, each of which contains about a dozen tabs describing personal data of employees, courses they have taken, their success in terms of sales and plan fulfillment, employee survey data, etc. . The main characteristic of the source data is their low quality. For competitors at Kaggle, the situation is unusual: there is no preprocessing of data, the quality of the source tables is low, sometimes beyond the limits. We obtained data with all possible quality problems:

Data in one column could be of different types: numbers, text, empty values.
High sparsity of some tables, when most of the values are absent.
Explicit duplicates in the personal data, when the keys of the records are different, but the records themselves are identical. For example, there may be several personal records of the same guard with different keys user_id.
Structural problems: in one of the tables, part of the columns was located with an offset, in the other (employee survey results), the values were arranged either vertically or horizontally.
Conceptual problems: we considered the results with the first employee survey incorrect as the survey was not anonymized.

However, one should not be afraid of such “dirty” and noisy data, it is necessary to work with them, preprocess, transform. The main task that we set for ourselves (and which took up most of the time of the hackathon) is to assemble a real dataset from what we have and use it for further work. Actually, this was our advantage over other teams. When there are high-quality and well-aggregated data, further analysis becomes much easier.

The second difficulty was that the size of the source data was very modest. We also approached this problem as a challenge, because the quality of the final solution strongly depends on the size of the data that can be used. On the other hand, we were not required to produce solutions, but we had to demonstrate a working prototype.

As a result, we collected two datasets:

Employees (employees). A set of features describing the employee: age, gender, position, work experience, overall job satisfaction, fulfillment of sales plan, number of courses completed, average score for courses completed, and other similar features.
Courses (courses, webinars, documents from the library). All those objects that describe employee training in the company. This dataset is just an employee-course couples with additional clarifications: what grade did the employee get for the course, how much time did he spend on the course, etc.

Problem Solving # 1

We approached the problem (to choose the best way to train each employee) from a practical point of view: our solution should use existing materials, work throughout the company (not only for sellers) and not require manual programming of categories.

Obviously, a recommender system based on collaborative filtering is well suited to all of this — we have implemented its prototype. Such systems are ubiquitous now, we didn’t do anything new, but it was interesting to collect it using available data.

All data processing was carried out using standard Python tools: Jupyter, numpy, pandas, sklearn. A few important, though obvious, things:

Work with Excel-tables was carried out through xlrd , although, of course, nothing prevented us from using the standard Excel import from pandas: pandas.read_excel .
One-hot encoding posts. We presented the position of the employee in the form of a vector of length in the number of unique positions. All values of this vector are zeros, except for one unit opposite to its position. Accordingly, by coding the posts of all employees, we get a sparse matrix of dimension M x N, where M is the number of unique posts, and N is the number of employees.
Normalization of continuous values. In sklearn.preprocessing there is a class MinMaxScaler , which allows to bring our values to the desired range, by default in the range [0, 1] . As a result, we received 173 signs, each of which (including one-hot encoded) can have a value in the range from zero to one.

The next task is to determine the measure of the similarity of different employees with each other. We recommend similar courses for similar employees. In terms of recommender systems, this is User-User collaborative filtering. The easiest way is to use a cosine measure . We represent each employee as a vector in a 173-dimensional space and, in pairs for every two employees, we consider the cosine of the angle between the vectors:

In fact, knowing the scalar product and the rate of vectors, we can calculate how much the two employees are “similar”. Within sklearn this is done in one line:

sklearn.metrics.pairwise.cosine_similarity(df_profiles.iloc[:,1:])

We get a matrix of dimension N x N, where N is the number of employees. Each value in the matrix is a measure of the similarity of two employees. We wanted to add non-linearities to this measure in order to emphasize more closely the similarities or differences between the two employees:

 def norm_cos(x): if x < 0.25: return x / 10 if x < 0.5: return x / 5 if x < 0.75: return x / 2 return x

Further we undertook dataset with courses. In fact, it was necessary to make the normalization of course estimates using basic predictors of employees (users) and courses (items). The “grade” of a course is simply the sum of the number of times each employee has completed the course (or a webinar, or viewed an electronic document on the portal). By counting the basic predictors, we can remove the estimate bias. For example, if one employee passes many courses, his assessment will dominate and strongly distort the recommendations. Such distortions are well identified and reduced by basic predictors.

Having obtained two matrices: the employee similarity matrix and the course score matrix normalized by basic predictors, we can multiply them and, thus, get an estimate of each course (item) for each employee (user). To recommend courses, we can take the required number of courses with the highest marks:

As we see, we don’t, in fact, recommend courses for the storekeeper. But for the manager, we can advise something.

Problem solving # 2

Having collected data with employees, we wanted to try to solve the second problem: dig up an interesting pattern in the data and, ideally, also turn it into a product. We formulated the thesis - if an employee, by the totality of his characteristics, behaves as an employee of a higher position (for example, a salesman and a senior salesman), then it would be interesting to determine this and give a signal to his manager.

We decided to use a dimensionally derived principal reduction (PCA) technique, leaving two components. These PCA results are easier to visualize and can be shown to management.

After the transformation, we clustered the PCA results with K-Means . The number of clusters was selected using medium silhouette score clusters. The higher this figure, the better the clusters are separable from each other.

K-Means identified 10 clusters in this dataset. As you can see, cluster # 3 is farthest from all the others and clearly separable from others - therefore, its silhouette score is higher than others. The red vertical line in the left chart is the average.

What does such clustering give us? It turned out that our clustering turned out in fact according to posts and “a bit of something else”. For example, cluster # 0 is mostly cashier sellers, and cluster # 5 is senior cashier sellers. But sometimes in cluster # 5 we see simple cashiers.

In the original dataset, we had a lot of information about employee productivity and their training. Therefore, when reducing the dimension and clustering, this information allowed us to attribute two people of different positions to the same cluster.

Instead of conclusion

What conclusions did we personally draw after the hackathon?

First , do not underestimate the time for data processing. It took us a tremendous amount of time to convert, process, generate features, normalize, and process attributes with missing values. Without preprocessing, full data analysis would be problematic. We considered dissatisfaction with the original data unprofessional and did everything we could by preprocessing.

Secondly , we have not invented anything new. PCA, K-Means, item-item, user-user collaborative filtering - all of these techniques have long been known. PCA and K-Means work practically out of the box at sklearn. However, if you impose them on a specific task, they are able to surprise people who have not seen anything like this before.

Thirdly , it is very important to be able to present your project. Remember that a working prototype made on the knee and trained on piecewise data is better than any theoretically correct solution or analysis, but without a demonstration of a real, living product that can be touched.

Source: https://habr.com/ru/post/316616/

All Articles