Start of big data project: 6 important issues

The use of data in its activities has long become obvious to many, the potential benefits are clear, but sometimes it is not clear where to start and how to move into this future, which has already arrived somewhere.

In connection with the start of big data direction or just a project, the manager has many questions in his head, the answers to which he wants to receive.

1. I have seemingly some data. What can I do with them?
')
Of course, initially it is necessary to proceed from the objectives of the project, which in turn are translated from current-business tasks. Speaking big, using data can help increase revenues or reduce costs through optimization. You can, for example, more effectively hire people: spend less money and time on it; employees will decrease “routine”. Or introduce a recommendation system that will help increase the average customer’s receipt, while offering something really important and necessary to him. For example, the largest telecom companies in Russia - MegaFon, Beeline, MTS - apply an individual approach in determining tariffs. Analyzing their own subscriber data for several dozens of parameters, the companies offer customers individual tariff plans, and as a result, the profit per subscriber increases.

2. I understand what I want. What data do I need?

There are two types of data: internal and external. From the point of view of strategy, it is better to start with the data that you already have. The use of external data sources is the next step, rather necessary to enrich the existing ones. This allows you to increase the quality of models that will be built on their basis. External data sources include data from social networks, where you can find the most important and relevant information about the client, as well as data from the Internet of Things (Internet of Things) - by 2020, millions of devices will be interconnected through IoT, improving all spheres of life: from smart homes to traffic lights regulation. The use of the Internet of Things today is of great benefit to companies: Apple constantly collects data from all its devices, from the iPhone to the Apple Watch, receiving valuable information: is the design and interface convenient, how often do people use devices, etc., to constantly improve their products.

At the same time, what specific data is needed will become clear only after you formulate hypotheses. If we are talking about improving the efficiency of HR, then we need to think that theoretically it can influence. As an example, if we predict an outflow of employees, we can accumulate the following hypotheses: “people who plan to leave begin to be late more often”, “people who plan to leave, spend less time at the computer”, etc. A similar case was recently described on Habré.

Retail banking is also a vivid example: during the credit scoring process, banks would like to know as much as possible about their customers, and here age, income and credit history may not be enough for a qualitative assessment of the likelihood of non-payment of a loan, therefore over the past few years the largest banks Countries (Sberbank, VTB24, Alfa-Bank, Tinkoff) began to use external data sources in scoring, in particular, customer profiles in social networks.

3. The data is more or less clear. What are the algorithms for working with them?

There are a number of different types of analysis: descriptive, exploratory, predictive, etc. Each of them decides its role and can bring value to the organization. Perhaps the most interesting is the predictive one, which often comes down to the use of machine learning algorithms.

The essence of machine learning is as follows. We have data on some number of objects, and we know a certain result on them, which we want to predict. At the same time, we have data on other objects, and we ask the algorithm to make a forecast for them, using the knowledge from that dataset.

In a big way, there are two tasks of machine learning with a teacher: classification and regression. In the case of classification, we predict a categorical variable: gender, age category, fact of purchase, etc. In the regression problem, we make a forecast for a quantitative variable: the cost of an apartment, wages, sales, etc.

Today, the most popular regression and classification algorithms are gradient boosting, random forest, logistic regression, and neural networks. We have already mentioned credit scoring, which is a prime example of solving the classification problem.

If we talk about regression, one of the examples of such a task can be the forecasting of demand for products. In the first quarter of 2016, Yandex Data Factory developed and successfully tested a predictive demand model for discounted goods for Pyaterochka, the X5 Retail Group brand. Goods for the shares occupy about a third in the total turnover of the company, so more accurate planning will reduce costs due to the storage of excess inventory, or vice versa, their lack.

In addition to learning from tagged data, there is another type of task - learning without a teacher, that is, building a model from data that does not have a target variable, that is, not labeled, for example, clustering - dividing the population into similar groups. The business analogue of this task is user segmentation to create individual offers, which we have already mentioned.

However, there are two points. First: there is no marked data (data where the result is known) - there is no possibility to make a forecast, and if there is, then there can be difficulties with what the target variable will represent within this business task. For example, you want to determine the ideal location for new outlets. What will be the most important criterion of “ideality”: revenue, proximity to the metro, the number of visitors per day?

The second is that the quality of the models is influenced more by the amount of data rather than the complexity of the algorithms.

4. Ok. What software do I need?

First, you need to decide whether you are ready to use open-source solutions or an enterprise. Open-source are good because they are free, but if something breaks, then there is no support. Enterprise solutions can be customized for you, plus they will be supported by professionals in this field. For example, QIWI , Tinkoff, Sberbank developed the big data cluster on their own, many other companies seek the help of third-party experts who can develop a ready-made solution for the business.

Secondly, the choice of software depends on the amount of data. If there is a lot of data, then the current standard is the use of Hadoop ecosystem tools, within which there is both a distributed HDFS data store, HBase column database, Hive and Spark analytics tools, and much more. These tools are now actively using, for example, Sberbank. If there is little data, then it will fire guns on sparrows, so it is quite possible to get along with ordinary relational databases and, for example, Jupyter Notebook - the environment where the data scientist performs most of his calculations, builds models and preprocesses data.

Third, the choice of software is influenced by the type of data processing that best suits the needs of the company. There are two main types of big data processing: streaming processing involves analyzing data at intervals of up to several seconds, which is suitable for companies working with continuous data: e-commerce, SMM, retail. More than 350 million tweets from more than 140 million users post on Twitter daily, so the company uses a streaming approach through the Apache Storm to handle such a huge stream of data. In batch processing, a comprehensive analysis of all available data takes place, calculations take more than a minute and the complexity of calculations is put above speed.

5. And what about the “iron”?

Iron also depends on the amount of data you are going to operate on and the software you plan to use. The idea of current big data solutions is to use the so-called commodity equipment. This means that supercomputers are not needed here, but rather ordinary servers, but of course: the more powerful they are, the better.

Another point to think about is whether to buy servers, rent them or use the cloud. If you work with personal data, then the answer is almost always the same - buying your own servers. If there is no personal data, then other options may also be beneficial from an economic point of view. The potential advantage of the cloud is that it can very quickly be tested and made by some pilot on it, for example, to continue to make a decision. Also, clouds are quickly restored after accidents and are easily scaled by pressing a few keys, while physical servers require months of planning. Plus, they have grants programs for startups, which is also not bad: one and two .

6. Suppose I have it all. What kind of people do I need?

It is customary to single out three roles: data scientist, data engineer, data manager. The first, as a rule, is able to program, understands mathematics and builds those models of machine learning. The second is often involved in the collection and preprocessing of data, as well as configuring software. The third one understands business very well, is able to monetize data and is able to correctly assign a task to two others in their own language. To keep all three at first may seem a luxury, and since the company does not yet have this expertise, it is not clear how to hire them. One for lifepacks is to send your trusted fighter to training, where he can fully immerse himself in this new and complex topic. It is clear that he will be more a universal soldier, but at the initial stage this is a plus. As a result, in this way, an examination will appear inside, plus a network of contacts of people who can be contacted if something happens.

Around these six questions, but not only them, our Big Data for Executives program is built. At the end of it, our participants fill out a specially designed project template for evaluating a big data project, receiving a kind of roadmap. At the presentation of projects, our expert gives feedback and useful tips in terms of strategy.

Source: https://habr.com/ru/post/329544/

All Articles

Start of big data project: 6 important issues

More articles: