Habr, hello! We interviewed a graduate of the Big Data Specialist program, a machine learning and big data project manager at QIWI, Sergey Chekansky, in which Sergey spoke about the experience of developing and implementing big data clusters, a typical Data Scientist day, and also gave practical advice to novice analysts.
- Hello, Sergey! It is nice to see that your career in Data Science is developing successfully, tell me why you decided to develop in the field of data analysis and how did you find yourself in QIWI?- It all started with my admission to the MIT master's program, where I decided to add a few subjects to the program related to data analysis and algorithms, in particular, Data Science in marketing, finance and healthcare. I got the first practical experience of data analysis in the process of writing a diploma: I studied the effect of tweet posts about the economy of Greece, Portugal, Italy and Spain on the quotes of credit default swaps, that is, on the probability of default of a certain country.
')
Having access to a huge number of tweets through the MIT database, I managed to build a model that conducted a sentimental analysis of tweets, giving them a value from -5 to 5, depending on the emotional tweet color. After graduating from the magistracy, I realized that Data Science is an area that interests me. I took most of the online courses in Data Science, for example, the classic course from Andrew Ng. A year later, I received an invitation from a former classmate to join the work on Pay Later, a startup that developed an online payment system through which users could make purchases with deferred payment. The main difference from the usual loan was that we did not require passport data, but only a phone number and confirmation of authorization in the social network. Thus, it was necessary to develop a scoring system that uses data from the social network, a phone number and indirect signs that can be obtained via IP from localStorage, that is, everything that can be “pulled out” from a computer.
If before that I had only practiced modeling on Kaggle, then in “Pay then” it was necessary to introduce a scoring Data Science system into production. The peculiarity of the project was that the scoring should have been online, the main requirement - a maximum of 5 seconds per application - is the time during which it was necessary not only to collect data from all sources, but also to “shoot” the client. During the year I wrote one system, then the staff began to expand, we worked on improving the accuracy of scoring, connected new services.
In parallel, we tried to develop other projects, in particular, entered the e-commerce market. A problem has arisen here: if a user makes an order for physical goods and wants to pay in cash, he may eventually refuse the goods. Then we decided to apply our scoring skills in order to assess the probability of non-payment at the time of placing the order by the user. Having entered into a partnership agreement with one of the companies-aggregators of delivery services, we integrated into their interface in order to collect the necessary data: how the person formed the order, what he did on the website, etc. Thus, a new scoring model was built and put into production.
Soon, QIWI paid attention to us, with which we began to cooperate first in the framework of some projects of their R & D department, developing first the e-commerce direction, and then, in the summer of 2016, they decided to
acquire our startup.
- How does the work in QIWI differ from “Pay later”? At what stage of development was big data infrastructure when you joined the company?
- “Pay later” had a lot of products, but at QIWI we decided to focus on two areas: e-commerce solutions that we developed as part of a startup and directly developing and implementing data-driven systems, for example, we brought our solutions to the company credit scoring. For several months, we dealt with analytical systems and databases of the company and realized that it was necessary to build a big data cluster, since there was no single repository for large-scale calculations. All analytics was mainly performed in relational databases in Oracle, and for some purposes
Vertica was used - a useful DBMS, but dozens of terabytes of data are stored in it expensive. Finally, we started testing various technologies and as a result, our choice was
Apache Spark .
- Why did you decide to choose Spark? What alternatives were considered?- Just at that time I went to study at the New Professions Lab and was inspired by how well Spark works and what things you can do on it. As an alternative, we considered Vertica, it allows you to quickly work with the column data format, it has many extensions, in principle it can be called the light version of Spark. However, comparing what can be done on Vertical and Spark, we realized that on Spark it is possible to do the same, only cheaper. Moreover, in it we are not limited in the format of data storage, we can use various options, from text and sequence files to the Storage Engine. We also paid attention to the
Apache Kudu technology and decided to try using it as a Storage Engine for a piece of data. As a result, the big data architecture looks like this: we store part of the data in the “parquet” in HDFS, part - in Kudu, and process all of this with Spark.
- What data and from what sources are used by the company in analytics? What tasks besides scoring arise?- Basis - this is data about transactions, from social networks, as well as clickstream from 3 sources: website, mobile applications and QIWI wallet along with terminals.
As for the tasks, in addition to scoring, it is identifying user preferences for increasing click-through rate (CTR), identifying weaknesses on the site, evaluating the effectiveness of communications, behavioral analytics, analyzing social networks, and in our case, analyzing transactional networks formed in the process. making money transfers between users to identify fraud.
- What algorithms are used to solve the tasks? Do you have any favorite methods or do you need to search for your own task?“First of all, we are trying to apply new technologies, but the main direction of the experiments is not in building models, but in data processing, from where the idea of using Spark came from. Of course, we use many traditionally favorite methods, like most Kaggle tasks that are not related to sound or image processing: gradient boosting, decision trees. The random forest is mainly used to get the initial result, find out which variables are key and which need to be discarded, then usually we look at how this task will work gradient boosting, try to combine several models, however, of course, everything is limited by speed work model. We can not afford, like at Kaggle, to make a crazy ensemble of 130 different models, in fact, the more important task is to search for new signs and new dependencies between signs.
In turn, solving the problem of user segmentation, we first carry out a manual selection of signs, and then apply, for example, insulating forests, see how the simple method k of averages works. Also recently, we had the task of identifying users who perform actions that are not fraud in its purest form, but potentially it is something unacceptable. Thus, it was necessary to segment these several thousand users that fit the criteria, and to understand what goals these transactions are pursuing. To solve this problem, we used matrix factorization to highlight the main factors and already carry out segmentation on them, also tried a new algorithm -
BigARTM , which is used to determine the subject of texts. We presented each transaction as a word, and the set of transactions as a sentence and using the model tried to determine the theme of the set of sentences for each user. The constructed model showed quite a good result.
- What profit has the company received from the implementation of the big data cluster?- In fact, the production of the cluster option started only at the end of March, so it is difficult to evaluate the business effect, but the result may be noticeable from the point of view of increasing productivity: we have increased the speed of parameters calculation several hundred times, we can afford to process several years of transactional data, as well as instantly analyze terabytes of the clickstream, which come every second from the site and applications, that is, the value from the implementation of big data is a significant saving of time and, consequently, costs.
- Tell me, how is the typical Data Scientist day going?- If we talk about my usual day, then 50% of the time is spent on solving technical problems: checking the purity of the data, tasks on loading and unloading data, checking the accuracy of the data, where to store the “storefront”, etc., 30% in turn goes to various internal corporate questions, and only the remaining 20% of the time is occupied by the analyst: building models, testing hypotheses, testing. And in the preparation of the data, first of all, the solution of the problems of other employees is included: why some module does not work, why the data type is converted incorrectly, etc., therefore, in order to do this quickly and painlessly, it is necessary to leave the paradigm “I am a mathematician and analyst” and be a hacker programmer.
- What programming language do you think is better suited for data analysis?- For data analysis, there are several languages that are currently the most popular: R, Python, someone will say that this is Scala or Julia. Exotic options, of course, are good, but only if you are initially well-versed in them. If a person starts from scratch, then, of course, it is better to choose the language in which most information is available on the Internet, and at the moment it is Python. Since the first year of a newbie in Data Science will be “googling” how to do something, it is easiest to do it in Python. Further, after mastering Python, I would pay attention to Scala, because it allows you to do many things differently, for example, to use the advanced API Spark.
- What skills should any Data Scientist have?- According to tools for myself, I stopped at the Python, doing some tasks on Scala. Naturally, I can’t do without such libraries as Pandas and Scikit-Learn, while the Pandas package is mandatory, some try to ignore it, use cycles, but this approach spends time and functions written by hand are often much slower than what written for you and tested many times. Thus, Pandas must be used, to see all its extensive documentation, because it can do almost everything, from data transformation to the record in the database. Also, the advantage of the library is that in recent versions of Spark its API almost completely repeats the Pandas API, and the transition from the library to Spark is almost painless. Also, if a person plans to not only solve tasks on Kaggle, but also in the production environment to develop Data Science products, then you need to learn how to use modules for working with databases from
pyodbc to
SQLAlchemy .
“You said you went through a lot of machine learning courses.” Which one would you recommend to newbies?- I would recommend starting with the course
Andrew Ng , it has a fairly good structure and in simple language explains the logic of the algorithms, which is very important. You need to understand why one algorithm works well in this task, but it does not work well in another, why you need a lot of RAM to apply SVD, what is a local maximum, and so on. These things are very well explained in this course, creating a solid base, and then you need to choose courses on the topic that you are most interested in, be it text analysis or Social Network Analysis, in order to gain specialized knowledge. Also in Russian there are now a large number of courses, for example, from
Yandex and
MIPT , however, unlike Andrew’s course, they go more into mathematics and less into the logic of algorithms, but their course perfectly covers methods for evaluating models, quite well told about ROC- AUC curves. It is worth noting that most courses do not spend so much time generating signs, because this topic is highly dependent on specific data and here many people have problems. I would advise you to go through several different courses in order to develop creativity and be able to apply different approaches to data transformation.
- What did the Big Data Specialist program from New Professions Lab give you?- First of all, the program introduced a huge range of different tools, expanding my technical outlook on the principles of their work, and I also received an understanding that with little effort, very much can be achieved if you have the necessary tools. Thanks to the program, I was able to choose the right technologies to enable the best way to implement and configure the system, in particular, I realized what Spark is capable of, although before that I had a misconception that it would be more difficult and more expensive to implement Spark than to do the same work by old means. As a result, I realized that big data is clear and simple, and I learned how to use these technologies to extract maximum benefits with minimal costs.
- And finally: in what direction will machine learning technologies and big data develop in the coming years?- I think that the bias will be in the direction of neural networks, and from the technological side, tools for working with neural networks for distributed computing will already appear in the Hadoop stack. Data is becoming more and as a result increases the efficiency of the neural networks. There are also more and more courses, a huge number of tools that allow you to get acquainted with neural networks, and Tensorflow, in particular, has excellent tutorials. Thus, neural networks are becoming increasingly available, and computing and hardware are much cheaper.