Data Science Week 2017. Review of the second and third day

Hi, Habr! We continue to talk about the Data Science Week 2017 forum held on September 12-14, and the second and third days are next in line, where the issues of building recommender systems, analyzing data in Bitcoin and building a successful career in the field of working with data were touched upon.

Second day

Sberbank

The second day of Data Science Week was opened by Alexander Ulyanov, Head of Model Development at Sberbank, a graduate of the 6th launch of the Big Data Specialist program. Alexander spoke about the use of the LibFM library in building recommender systems in cable and Internet TV. This task is one of the laboratory works on the program, and Alexander took the first place in its results.

Immediately it is worth noting that this library is applicable not only in recommender systems, but also in the analysis of time series. It is interesting that there are very few materials about her in Russian, although thanks to her several Kaggle competitions were won.
')
In this case, there was a classic recommendation task: there is a user, there is a film, we want for each film to predict its rating for this user and then recommend this user films with the maximum predicted rating, or evaluate the probability of a particular movie being purchased by this user.

The main problem of building such systems is to design the feature space in such a way that a huge amount of information about both users (personal office, social network) and films (genre, year of release, actors) fit into it. The solution is to present each event — the user has rated the film as a row vector, organized like this:

By combining all the vectors, we get a very sparse matrix with more than 150,000 columns, which will be the feature space, and as a target variable we take the rating rating or the final event - bought / not bought the movie:

Now we proceed directly to the model itself, which can be divided into 2 parts: the classical linear regression and factorial interaction of all factors among themselves, which, by adjusting the parameter k, allows the algorithm to work with such a sparse matrix:

h a t y (x) = w_{0} + s u m_{i = 1}^{n} w_{i} x_{i} + s u m_{i = 1}^{n} s u m_{j = i + 1}^{n} l a n g l e v_{i}, v_{j} r a n g l e x_{i} x_{j}

$\ hat y (x) = w_ {0} + \ sum_ {i = 1} ^ {n} w_ {i} x_ {i} + \ sum_ {i = 1} ^ {n} \ sum_ {j = i +1} ^ {n} \ langle v_ {i}, v_ {j} \ rangle x_ {i} x_ {j}$

Where

l a n g l e v_{i}, v_{j} r a n g l e = s u m_{f = 1}^{k} v_{i, f} v_{j, f}

$\ langle v_ {i}, v_ {j} \ rangle = \ sum_ {f = 1} ^ {k} v_ {i, f} v_ {j, f}$ . Now that the model has been formalized, we will consider its initial version using only information about the subscriber id, the id of the program and the fact of its purchase. As a metric took ROC-AUC:

fm_train.to_csv('train.libfm', header = None, index = False, sep = ' ') fm_test.to_csv('test.libfm', header = None, index = False, sep = ' ') #      ,         : !sed -i ' s/" //g' train.libfm !sed -i ' s/" //g' test.libfm !./libFM -task c -train train,libfm -test test.libfm -method als -dim '1,1,8' -iter 200 \ -regular '0,0,15' -init_stdev 0.1 -out prob9.txt

As a result, without using additional data about clients and TV shows, it was possible to obtain a ROC-AUC value of 0.923. Adding information about films (genre and year of release) made it possible to increase the value of the metric to 0.935. Finally, using all the available information (customer data was added: the viewing time interval, live or recorded), we received a total of 0.936.

Library features:

The ability to add an unlimited number of "features" by working with the format of sparse-vectors.
The non-linear basis of the algorithm (when not only the “features” are taken into account, but also their interaction with each other).
Relatively high computation speed (O (kn), where n is the number of features, and k is the model's hyperparameter, which determines the dimension of the interacting vectors (about 10)).
Requires special preparation of data in sparse-format.
Extensive optimization toolset: from MCMC (Markov chain Monte Carlo) to ALS (Alternating Least Squares).

VISA

Then Alexander Filatov from the Analytics Department of Visa in Russia talked about how to establish a dialogue between analysts and business, so that the latter really understand the basis of what the model was built and recommendations were made.

For example, some bank has a portfolio of credit cards, the profit from which you want to maximize. From a business point of view, there are 3 approaches to this task, each of which can be assigned an appropriate mathematical model:

At this stage, analysts enter into the business, begin to build models, test them, get results and issue a report that says that R-square is 90%, ROC-AUC is 0.92 and ROI is 110%, the project will pay off less than a year and send a report to a business that speaks a completely different language, sees this solution, but cannot recognize it. How to bring the problem and its solution to the business?

The easiest way to do this is to write a story and tell it. Here there are 3 main stages in the creation of a story:

Description of the actors. Of course, any story begins with the introduction of actors, so within our case we first divide all our clients into different groups: advanced card users (users of more than 3 banks, active users), international users (travelers, foreign citizens), borrowers (availability mortgages, car loans) and others.
Disclosure of characters. When all the actors are known, we will begin to reveal the characters of the characters and compare them according to some similar characteristics: how much income they bring, how quickly they develop, etc. For example, in the group of active card users, the share of women is 80%, and therefore it is necessary to offer them women's goods or increased cashback. At this stage, the business begins to develop an understanding of which clients need to work and in which direction to move.
Analysis of the influence of characters on the final result. Finally, when the business is imbued with the essence of history, it's time to remember that we built a model, we have a forecast, and we can estimate the potential profit. The easiest way to show the result of the model on a two-dimensional graph:

As a result, according to the model, it is clear that it is necessary to work with clients located in the upper right part, they have the greatest value now and will bring even more income in the future. After the story being told, it can be seen by the business, which, having become acquainted with the clients and the peculiarities of their behavior, having assessed the degree of influence of each of them on the financial result of the company, perceives the models and recommendations of analysts much easier.

Riftman

Andrei Manolov from Riftman spoke about the Apache Spark application for analyzing information in Bitcoin. The main problem was that Bitcoin does not directly store information about senders and recipients of transfers and what their wallet balance is. Each transaction in Bitcoin has outputs that have values that show how many coins were sent to someone else, and inputs that do not have values: they only contain a reference to the previous transaction and the output number.

Thus, in order to obtain the necessary information, we picked up a bitcoin node on a hosting in a unified network and set up so that all data was loaded into the Apache Spark cluster, and the following inverse operation was carried out: we unwind the whole process in the opposite direction, moving from the outputs with values transactions to inputs with references to previous transactions, then back to outputs and so on, until we understand to which address how many bitcoins belong. Thus, a 3 job was done on Spark: unloading of all transactions of interest (in our case out> 10 BTC), Broadcast join and Self join. As a result, we obtain 3 data tables in a format that is convenient for analysis.

Possible applications and solutions development:

Generation of trading signals for trading. With the help of our project, it is possible to identify massive cash inflows to the stock exchange - an event that positively correlates with an increase in the Bitcoin exchange rate (the demand for cryptocurrency increases with constant supply), which can be used in trading.
Tracking the origin of funds on a legalized wallet. Also, this solution can help in the issue of the legalization of wallets, since today Bitcoin is often used as a means of payment for illegal purposes. For example, knowing the source of funds in the purse of interest to us and having a register of "bad" wallets, you can count on the column the distance from legalized to "bad":

If the wallet in question is 1-2 steps away from the “bad”, then most likely it is somehow connected with illegal transactions and cannot be legalized.

RnD Lab

The second day was completed by Cyril Danilyuk - Data Scientist at RnD Lab with his pipeline using Deep Learning to recognize road signs, which we told you about here , here and here .

The third day

Bssl

The third day of the DSW began with Alexander Larionov, the CEO of BSSL, which deals with business sociometry according to the Azimuth method, an employee evaluation system used to analyze the interactions and competencies of employees in the organization.

Before proceeding directly to the analysis, it is necessary to collect data through a survey. Employees of the company answer questions about working relationships within the team: who among the colleagues over the past six months contributed to the tasks you worked on? Whose attention, assistance or help was needed?

Then, based on these data, a number of techniques are built:

Social network and citation index. To measure the demand for each employee, we can build a social graph, where it will be clearly visible who is “needed” by a larger number of people, that is, the person with the most number of incoming shooters and will be most in demand in the team. An alternative metric of relevance may be the Page Rank (PR), which Google uses to rank pages. The logic of the algorithm is such that if another quoted page refers to my page, then my PR grows faster than if a set of uncited pages referred to it. For example, consider the graph below:

On the one hand, the largest share of the collective “needs” employee D, but if you look at it from the page rank point of view, it is employee C, not D, who is most in demand because a person often turns to him many people.
Egocentric network. It is obvious that social graphs are well suited for a small number of participants - 10, 20, but when their number is much more impossible to depict all the links. There is an alternative in the form of an egocentric network - a way of visualizing the relationship of one particular person with colleagues, where the proximity of a circle to the center means the intensity of interaction with the "center", the size of the circle - the relevance and so on (about 10 parameters).

Compatibility Matrix In addition to how healthy the team relationships are, we would like to know about the interchangeability of employees (in case of illness, vacation, etc.). For this, many metrics are used: soft-skills replacing, familiarity with the range of tasks, positive relations with the range of interaction of the replaced one and himself. Also, if a person systematically chooses someone for “negative questions” (“was unavailable”, “busy”), then most likely he does not treat him very well and vice versa. Then, according to the employees' answers, a compatibility matrix is built, where employees are marked in rows and columns, and at the intersection - correlation coefficients of their answers multiplied by 1000:

As you can see from the matrix, for example, William has an excellent relationship with Sylvia and Alice, but not with John.

Then there was a panel discussion on the topic "Selection of teams for working with data and evaluation of their effectiveness." Olga Filatova, Vice-President for Personnel and Educational Projects of Mail.ru Group acted as moderator, and the participants were Victor Kantor (Yandex), Andrey Uvarov (MegaFon), Pavel Klemenkov (Rambler & Co), Alexander Yerofeyev (Sberbank). We will write about this time session separately, because there is something to tell about.

Buran HR

She continued her conversation on teamwork with Anahit Antonyan, CEO of Buran HR , a company that selects people to create start-up teams. She told about how a young IT specialist to choose between work in a startup and a large corporation, and who is ultimately satisfied with his choice.

“According to many years of experience in the field of HR, I can say that there are competencies that are more suitable for working in a startup, which at the same time can complicate work in a large company:

At the same time, if we are talking about satisfaction with our choice, then the situation is such that a young start-up (up to 27 years old) is, on average, 40% more satisfied with his work than a corporate employee of the same age, while for people over 32 years old the opposite is true: adult “corporat” is on average 36% more satisfied.

Considering not the IT industry as a whole, but Data Science, one can notice that experts in this field feel happier than other developers ”:

Best Brains Consultancy

Natalya Tikhomirova, executive coach and head of the BestBrains Consultancy company, finished her busy third day and the entire conference with a story about how to prepare yourself for new career turns.

In order to be ready for a job change, a pause in a career and other changes in your professional life, you must first clearly realize your own contribution to the company’s activities and begin to work with your own fears.

No matter how trite it sounds, fear is normal. To overcome it, I propose to use the following method: before starting any career turn, when you do not know how to get to the cherished point B, simulate 2 situations that force us to move forward: the most terrible, the most hellish situation, which can happen to you on the way to the final goal, and then be sure to model the most desirable, the thing you want most. As a result, spinning your way in this way from the end point to the starting point, it will be easier for you to understand what is holding you back and what is moving forward.

The same with errors. In Russia and in the West, the approach to errors is fundamentally different. We have not taken openly to talk about them, every career history is “cleaned” to shine, while in the West people openly talk about their career failures, since it is immediately clear that victories are not accidental, but are the result of conclusions and many attempts.

At the next stage, you need to understand yourself and answer the following questions: “What do I know and can do?”, “What is important to me?”, “Where and why am I going?”. True knowledge about yourself and the professional environment works for adequate self-esteem, self-confidence and efficiency. Another important question: “What can stop me?”. This may be low motivation, lack of support, communication isolation and other factors that need to be identified and regulated.

Finally, when a person is clearly aware of who he is, where he wants to go and what to become before a career turn, it’s time to act:

The partner of Data Science Week 2017 is MegaFon, and the info partner is Pressfeed.

Pressfeed - A way to get free publications about your company. Subscription service for journalists inquiries for business representatives and PR specialists. The journalist leaves the request, you answer. Sign up. Have a good job.

Source: https://habr.com/ru/post/339956/

All Articles