1.1 Introduction
Thanks to machine learning, a programmer is not required to write instructions that take into account all possible problems and contain all solutions. Instead, the computer (or a separate program) lays the algorithm for finding solutions by integrated use of statistical data, from which regularities are derived and on the basis of which predictions are made.
Machine learning technology based on data analysis begins in 1950 when they began to develop the first programs for the game of checkers. Over the past decades, the general principle has not changed. But thanks to the explosive growth of computing power of computers, the laws and forecasts created by them have become many times more complicated, and the range of problems and tasks solved using machine learning has expanded.
To start the machine learning process, first you need to load into the computer Dataset (some amount of input data), on which the algorithm will learn to process requests. For example, there may be photos of dogs and cats, which already have tags indicating which they belong to. After the learning process, the program itself will be able to recognize dogs and cats in new images without tags. The learning process continues even after the predictions issued, the more data we have analyzed by the program, the more accurately it recognizes the desired images.
')
Thanks to machine learning, computers learn not only to recognize faces, but also landscapes, objects, text and numbers in photographs and drawings. As for the text, then here we can not do without machine learning: the function of checking the grammar is now present in any text editor and even in phones. Moreover, it takes into account not only the spelling of words, but also the context, shades of meaning and other subtle linguistic aspects. Moreover, software already exists that can write news articles (on the subject of economics and, for example, sports) without human intervention.
1.2 Types of machine learning tasks
All tasks solved using ML fall into one of the following categories.
1) The task of regression - the forecast based on a sample of objects with different characteristics. The output should be a real number (2, 35, 76.454, etc.), for example, the price of an apartment, the cost of a security after half a year, the expected store income for the next month, the quality of the wine with blind testing.
2) The task of classification is to obtain a categorical response based on a set of features. It has a finite number of answers (as a rule, in the format of "yes" or "no"): is there a cat in the photo, is the image a human face, is the patient sick with cancer.
3) The task of clustering is the distribution of data into groups: the division of all clients of a mobile operator according to the level of solvency, the assignment of space objects to a particular category (planet, star, black hole, etc.).
4) The task of reducing the dimension is to reduce a large number of features to a smaller one (usually 2-3) for the convenience of their subsequent visualization (for example, data compression).
5) The task of identifying anomalies - the separation of anomalies from standard cases. At first glance, it coincides with the task of classification, but there is one significant difference: anomalies are a rare phenomenon, and there are either vanishingly little or no learning examples on which a machine learning model can be drafted to identify such objects. . In practice, such a task is, for example, identifying fraudulent actions with bank cards.
1.3 Basic types of machine learning
The bulk of the problems solved with the help of machine learning methods are of two different types: unsupervised learning with teacher (supervised learning). However, this teacher is not necessarily the programmer himself, who stands above the computer and controls every action in the program. "Teacher" in terms of machine learning is the very intervention of a person in the process of information processing. In both types of training, the machine is provided with the initial data that it is to analyze and find patterns. The only difference is that when teaching with a teacher there are a number of hypotheses that need to be refuted or confirmed. This difference is easy to understand with examples.
Machine learning with a teacherSuppose we have at our disposal information about ten thousand Moscow apartments: area, floor, district, presence or absence of parking at the house, distance from the metro, apartment price, etc. We need to create a model that predicts the market value of an apartment according to its parameters. This is an ideal example of machine learning with a teacher: we have the initial data (the number of apartments and their properties, which are called signs) and the ready answer for each apartment - its cost. The program has to solve the regression problem.
Another example from practice: to confirm or refute the presence of cancer in a patient, knowing all his medical indicators. Find out whether the incoming letter is spam by analyzing its text. These are all tasks for classification.
Machine learning without a teacherIn the case of training without a teacher, when the system does not provide ready-made “correct answers”, everything is even more interesting. For example, we have information about the weight and height of a certain number of people, and these data must be divided into three groups, each of which will have to be sewn into shirts of suitable sizes. This is a clustering task. In this case, all data should be divided into 3 clusters (but, as a rule, there is no such strict and only possible division).
If we take a different situation, when each of the objects in the sample has a hundred different characteristics, then the main difficulty will be the graphic display of such a sample. Therefore, the number of signs is reduced to two or three, and it becomes possible to visualize them on a plane or in 3D. This is the task of reducing the dimension.
1.4 Basic algorithms for machine learning models
1. Decision TreeThis is a decision-support method based on the use of a tree graph: a decision-making model that takes into account their potential consequences (taking into account the likelihood of an event), efficiency, and resource-intensiveness.
For business processes, this tree is made up of the minimum number of questions that imply an unambiguous answer - “yes” or “no.” Consistently giving answers to all these questions, we come to the right choice. The methodological advantages of the decision tree are that it structures and systematizes the problem, and the final decision is made on the basis of logical conclusions.
2. Naive Bayes ClassificationNaive Bayes classifiers belong to the family of simple probabilistic classifiers and originate from Bayesian theorem, which, as applied to this case, considers functions as independent (this is called a strict, or naive, assumption). In practice, it is used in the following areas of machine learning:
- detection of spam arriving by e-mail;
- automatic linking of news articles to thematic headings;
- revealing the emotional coloring of the text;
- face recognition and other patterns in images.
3. Least squares methodAnyone who has studied statistics at least a little bit is familiar with the concept of linear regression. The variants of its implementation include the smallest squares. Usually, linear regression solves the problem of fitting a straight line that passes through many points. Here is how it is done using the method of least squares: draw a straight line, measure the distance from it to each of the points (the points and lines are connected by vertical segments), the resulting amount is transferred to the top. As a result, the curve in which the sum of the distances will be the smallest is the desired one (this line will pass through points with a normally distributed deviation from the true value).
The linear function is usually used in the selection of data for machine learning, and the least squares method is used to minimize errors by creating an error metric.
4. Logistic regressionLogistic regression is a way to determine dependencies between variables, one of which is categorically dependent, while the others are independent. To do this, use the logistic function (accumulative logistic distribution). The practical significance of logistic regression is that it is a powerful statistical method for predicting events, which includes one or more independent variables. It is claimed in the following situations:
- credit scoring;
- measurement of the success of ongoing advertising campaigns;
- profit forecast for a certain product;
- estimate of the probability of an earthquake on a specific date.
5. Support Vector Machine (SVM)This is a whole set of algorithms necessary for solving problems of classification and regression analysis. Assuming that an object located in an N-dimensional space belongs to one of two classes, the support vector machine method builds a hyperplane with the dimension (N - 1) so that all objects are in one of two groups. On paper, this can be represented as follows: there are points of two different types, and they can be linearly divided. In addition to the separation of points, this method generates a hyperplane so that it is as far as possible from the nearest point of each group.
SVM and its modifications help to solve such complex machine learning tasks as DNA splicing, determining the sex of a person from a photo, displaying advertising banners on websites.
6. Ensemble methodIt is based on machine learning algorithms that generate many classifiers and separate all objects from newly received data based on their averaging or voting results. Initially, the ensemble method was a special case of Bayesian averaging, but then it became more complex and overgrown with additional algorithms:
- boosting — converts weak models to strong ones through the formation of an ensemble of classifiers (from a mathematical point of view, this is an improving intersection);
- bagging - collects sophisticated classifiers, while simultaneously teaching the basic ones (improving association);
- error correction output coding.
The ensemble method is a more powerful tool than stand-alone forecasting models because:
- it minimizes the effect of randomness, averaging the errors of each basic classifier;
- reduces variance, since several different models coming from different hypotheses are more likely to arrive at the correct result than one taken separately;
- eliminates going beyond the set: if the aggregated hypothesis is outside the set of basic hypotheses, then at the stage of the formation of the combined hypothesis it expands using one or another method, and the hypothesis is already included in it.
7. Clustering AlgorithmsClustering is the distribution of a set of objects into categories so that in each category - cluster - the most similar elements are among themselves.
You can cluster objects according to different algorithms. Most commonly used are the following:
- based on the center of gravity of the triangle;
- based on the connection;
- reduction of dimension;
- density (based on spatial clustering);
- probabilistic;
- machine learning, including neural networks.
Clustering algorithms are used in biology (a study of the interaction of genes in the genome numbering up to several thousand elements), sociology (processing the results of sociological research by Ward’s method, giving clusters with a minimum dispersion and approximately the same size) and information technologies.
8. Principal Component Method (PCA)The principal component method, or PCA, is a statistical orthogonal transformation operation, which aims to translate observations of variables that can be somehow interconnected to a set of principal components — values ​​that are not linearly correlated.
Practical tasks in which PCA is applied are visualization and most of the procedures for compressing, simplifying, minimizing data in order to facilitate the learning process. However, the principal component method is not suitable for situations where the initial data are poorly ordered (that is, all components of the method are characterized by high dispersion). So its applicability is determined by how well studied and described the subject area.
9. Singular decompositionIn linear algebra, the singular value decomposition, or SVD, is defined as the decomposition of a rectangular matrix consisting of complex or real numbers. Thus, the matrix M with the dimension [m * n] can be decomposed in such a way that M = UÎŁV, where U and V are unitary matrices and ÎŁ is diagonal.
One of the particular cases of singular decomposition is the principal component method. The very first computer vision technologies were developed on the basis of SVD and PCA and worked as follows: first, the persons (or other patterns that were to be found) were represented as a sum of basic components, then their dimensionality was reduced, and then they were compared with the images from the sample. Modern algorithms of singular decomposition in machine learning are, of course, much more complicated and sophisticated than their predecessors, but their essence has changed in general.
10. Independent Component Analysis (ICA)This is one of the statistical methods that identifies hidden factors that influence random variables, signals, etc. The ICA forms the generating model for the multifactor data bases. Variables in the model contain some hidden variables, and there is no information about the rules for mixing them. These hidden variables are independent components of the sample and are considered non-Gaussian signals.
In contrast to the analysis of the main components, which is associated with this method, the analysis of independent components is more effective, especially in cases where the classical approaches are powerless. He discovers the hidden causes of phenomena and thanks to this has found wide application in various fields - from astronomy and medicine to speech recognition, automatic testing and analysis of the dynamics of financial indicators.
1.5 Examples of real-life applications
Example 1. Diagnosis of diseasesThe patients in this case are the objects, and the signs are all the symptoms observed in them, anamnesis, test results, therapeutic measures already taken (in fact, the entire history of the disease, formalized and broken down into separate criteria). Some signs — gender, presence or absence of headache, cough, rash, and others — are considered binary. The assessment of the severity of the condition (extremely severe, moderate severity, etc.) is an ordinal sign, and many others are quantitative: the volume of the drug, the level of hemoglobin in the blood, blood pressure and pulse, age, weight. Having collected information about the patient's condition, which contains many such signs, you can download it into a computer and use the program capable of machine learning to solve the following tasks:
- carry out differential diagnostics (determination of the type of disease);
- choose the most optimal treatment strategy;
- predict the development of the disease, its duration and outcome;
- calculate the risk of possible complications;
- identify syndromes - sets of symptoms associated with the disease or disorder.
No doctor is able to process the entire array of information on each patient instantly, compile a large number of other similar case histories and immediately give a clear result. Therefore, machine learning becomes an indispensable assistant for doctors.
Example 2. Search for places of occurrence of mineralsIn the role of signs here are information obtained using geological exploration: the presence on the territory of any rocks (and this will be a sign of the binary type), their physical and chemical properties (which are decomposed into a number of quantitative and qualitative features).
For the training sample, 2 types of precedents are taken: areas where mineral deposits are exactly present, and areas with similar characteristics where these minerals were not found. But the extraction of rare minerals has its own specifics: in many cases, the number of signs significantly exceeds the number of objects, and the methods of traditional statistics are poorly suited for such situations. Therefore, in machine learning, the emphasis is on finding patterns in an already collected data set. For this, the small and most informative sets of features that are most indicative of the answer to the research question are determined - is there a particular fossil in this area or not. You can draw an analogy with medicine: the deposits can also reveal their own syndromes. The value of machine learning in this area lies in the fact that the results obtained are not only practical, but also of great scientific interest to geologists and geophysicists.
Example 3. Evaluation of the reliability and solvency of candidates for loansThis task is faced daily by all banks involved in issuing loans. The need to automate this process was long overdue, back in the 1960–1970s, when the credit card boom began in the United States and other countries.
The persons requesting a loan from the bank are objects, but the signs will differ depending on whether it is a natural person or a legal entity. The characteristic description of a private person applying for a loan is formed on the basis of the data of the questionnaire, which it fills out. Then the questionnaire is supplemented with some other information about the potential client, which the bank receives through its own channels. Some of them are binary signs (gender, phone number), others are ordinal (education, position), the majority are quantitative (loan amount, total debts on other banks, age, number of family members, income, work experience ) or nominal (name, name of the company-employer, profession, address).
For machine learning a sample is compiled, which includes borrowers whose credit history is known. All borrowers are divided into classes, in the simplest case there are 2 of them - “good” borrowers and “bad” ones, and a positive decision on granting a loan is made only in favor of the “good” ones.
A more complex machine learning algorithm, called credit scoring, involves assigning conditional points to each borrower for each attribute, and the decision to grant a loan will depend on the amount of points gained. During machine learning, the credit scoring system first assigns a certain amount of points to each attribute, and then determines the conditions for issuing a loan (term, interest rate and other parameters that are reflected in the loan agreement). But there is also another system learning algorithm - based on precedents.
PS In the following articles we will look more closely at the algorithms for creating machine learning models, including the mathematical part and the implementation in Python.