In this article I would like to discuss the basic principles of building a practical project on (so-called “intellectual”) data analysis, as well as fix the necessary terminology, including the Russian-language one.
According to Wikipedia,
Data analysis is the field of mathematics and computer science, which is engaged in the construction and research of the most common mathematical methods and computational algorithms for extracting knowledge from experimental (in a broad sense) data; the process of research, filtering, transforming and modeling data in order to extract useful information and decision-making.
Speaking a little more simple language, I would suggest to understand by data analysis a set of methods and applications related to data processing algorithms and not having a clearly recorded answer to each incoming object. This will distinguish them from classical algorithms, for example, realizing sorting or a dictionary. From the specific implementation of the classical algorithm depends on the time of its execution and the amount of memory occupied, but the expected result of its application is strictly fixed. In contrast, we expect a response from the neural network recognizing numbers 8 when an incoming picture depicting a handwritten figure eight, but we cannot demand this result. Moreover, any (in the reasonable sense of the word) neural network will sometimes make mistakes on certain variants of correct input data. We will call such a formulation of the problem and the methods and algorithms used for solving it are non-deterministic (or fuzzy ) as opposed to classical ( deterministic , clear ).
Algorithms and Heuristics
The described problem of digit recognition can be solved by trying to independently select a function that implements the corresponding display. It turns out, most likely, not very quickly and not very well. On the other hand, one can resort to methods of machine learning , that is, to use a manually marked sample (or, in other cases, one or another historical data) for automatic selection of the decisive function. Thus, hereinafter, a (generalized) machine learning algorithm I will call an algorithm, one way or another, based on data, forming a non-deterministic algorithm that solves one or another problem. (The non-determinism of the obtained algorithm is necessary in order to avoid the definition of a reference manual using preloaded data or an external API). ')
Thus, machine learning is the most common and powerful (but, nevertheless, not the only) method of data analysis. Unfortunately, people have not yet invented machine learning algorithms that process data of a more or less arbitrary nature well, and so the specialist has to do the data preprocessing in order to bring it into a suitable form for the application of the algorithm. In most cases, such preprocessing is called feature selection or preprocessing . The fact is that most machine learning algorithms take as input sets of numbers of fixed length (for mathematicians, points in ). However, now various algorithms based on neural networks are also widely used, which are able to accept not only sets of numbers as input, but also objects that have some additional, mainly geometric, properties, such as images (the algorithm takes into account not only pixel values, but also their relative position), audio, video and text. However, some preprocessing usually occurs in these cases, so that we can assume that for them the feature is replaced by the selection of successful preprocessing.
The algorithm of machine learning with a teacher (in the narrow sense of the word) can be called an algorithm (for mathematicians — a mapping), which takes as input a set of points in (also called examples or samples) and labels (values we are trying to predict) , and the output gives the algorithm (= function) already matching specific value any input belonging to the space of examples. For example, in the case of the above-mentioned neural network that recognizes numbers, using a special procedure based on the training sample, the values corresponding to the connections between neurons are set, and with their help, at the application stage, one or another prediction is calculated for each new example. By the way, a set of examples and labels is called a training set .
The list of effective algorithms of machine learning with a teacher (in the narrow sense) is strictly limited and almost does not replenish despite active research in this area. However, for the correct application of these algorithms requires experience and training. The issues of effectively reducing a practical task to the task of analyzing data, selecting a list of features or preprocessing, the model and its parameters, as well as competent implementation are not easy by themselves, not to mention working on them together.
The general scheme for solving the problem of data analysis using machine learning method is as follows:
The chain "preprocessing - machine learning model - postprocessing" is convenient to distinguish into a single entity. Often, such a chain remains unchanged and is only regularly updated on new data. In some cases, especially in the early stages of project development, its content is replaced by a more or less complex heuristic that does not depend directly on the data. There are more tricky cases. Let's get a separate term for such a chain (and its possible variants) and we will call it a meta-model . In the case of heuristics, it is reduced to the following scheme:
A heuristic is simply a hand-picked function that does not use advanced methods and, as a rule, does not produce a good result, but is acceptable in certain cases, for example, in the early stages of a project’s development.
Machine Learning Tasks with a Teacher
Depending on the formulation, machine learning tasks are divided into classification, regression, and logistic regression tasks.
Classification - statement of the problem in which you want to determine which class from some well-defined list refers to the incoming object. A typical and popular example is the above-mentioned number recognition, in which each image must be mapped to one of 10 classes corresponding to the figure shown.
Regression - statement of the problem, in which you want to predict some quantitative characteristics of the object, such as price or age.
Logistic regression combines the properties of the above two problem statements. It specifies the events that have occurred on the objects, and it is required to predict their probabilities on new objects. A typical example of such a task is the task of predicting the likelihood of a user clicking on a recommendation link or advertisement.
Metric selection and validation procedure
The metric of the prediction quality of the (fuzzy) algorithm is a way to evaluate the quality of its work, to compare the result of its application with the actual answer. More mathematically, this is a function that takes a list of predictions as input. and a list of incident responses , and returning the number corresponding to the quality of the prediction. For example, in the case of a classification problem, the simplest and most popular option is the number of mismatches , and in the case of a regression problem, the standard deviation . However, in some cases, for practical reasons, it is necessary to use less standard quality metrics.
Before introducing an algorithm into a product that works and interacts with real users (or passes it to the customer), it would be good to evaluate how well this algorithm works. To do this, use the following mechanism, called the validation procedure . The tagged sample that is available is divided into two parts - training and validation. The learning of the algorithm takes place on a training set, and its quality assessment (or validation) is on the validation sample. If we do not use the machine learning algorithm yet, but select the heuristic, we can assume that the whole marked sample, on which we evaluate the quality of the algorithm, is valid, and the training sample is empty - consists of 0 elements.
Typical project development cycle
In the most general terms, the development cycle of a data analysis project is as follows.
The study of the problem, possible data sources.
Reformulation in mathematical language, the choice of prediction quality metrics.
Writing a pipeline for learning and (at least test) use in a real environment.
Writing a problem solving heuristics or simple machine learning algorithm.
If necessary, improve the quality of the algorithm, it is possible to refine the metrics, to attract additional data.
Conclusion
That's all for now, next time we will discuss what specific algorithms are used to solve the problems of classification, regression and logistic regression, and how to make a basic study of the problem and prepare its result for use by an application programmer can already be read here .
PS In the next topic, I argued a little bit with people who hold a more academic point of view on machine learning issues than mine. With a few negative impact on my habrokarma. So if you would like to speed up the appearance of the next article and have the appropriate authority - please add me a little bit, this will help me write and post the sequel more quickly. Thank.