Correlations for beginners

Update for those who find the article useful and add to favorites. There is a decent chance that the post will go into a minus, and I will be forced to carry it into drafts. Keep a copy!

Brief and simple material for non-specialists, telling in visual form about the various methods of finding regression dependencies. This is all close and not academic, but I hope that is understandable. Rolled as a mini-manual on data processing for students of natural science specialties that math is not well known ~~, but like the author~~ . Calculations in Matlab, data preparation in Excel - so happened in our area

Introduction

Why is this even necessary? In science and around it, very often there is the problem of predicting some unknown parameter of an object based on the known parameters of this object (predictors) and a large set of similar objects, the so-called training set. Example. Here we choose an apple in the market. It can be described by such predictors: redness, weight, number of worms. But as consumers, we are interested in taste, measured in parrots on a five-point scale. From life experience we know that taste with decent accuracy is equal to 5 * redness + 2 * weight-7 * number of worms. That's about the search for this kind of dependencies, we talk. To make the training easier, let's try to predict the weight of a girl based on her 90/60/90 and height.

')

Initial data

As the object of the study I will take the data on the parameters of the figure of the girls of the month of Playboy. Source - www.wired.com/special_multimedia/2009/st_infoporn_1702 , slightly ennobled and translated from inches to centimeters. I remember a joke about the fact that 34 inches is like two seventeen-inch monitors. Also separated records with incomplete information. When working with real objects, they can be used, but now they only hinder us. But they can be used to verify the adequacy of the results. All our data is continuous, that is, roughly float. They are converted to integers just to keep the screen cluttered. There are ways to work with discrete data - in our example, for example, it can be skin color or nationality, which takes one of a fixed set of values. This has more to do with the methods of classification and decision making, which draws on another manual. Data.xls The file has two sheets. On the first one, the data itself, on the second, the screened incomplete data and a set to test our model.

Legend

W - real weight
W_p - weight predicted by our model
S - bust
T - waist
B - hips
L - height
E - model error

How to evaluate the quality of the model?

The task of our exercise is to obtain a certain model that describes an object. The method of obtaining and the principle of operation of a particular model does not concern us yet. It's just a function f (S, T, B, L), which gives the weight of a girl. How to understand which function is good and qualitative, and which is not very? For this, the so-called fitness function is used. The most classic and frequently used is the sum of the squares of the difference between the predicted and real values. In our case, this will be the sum (W_p - W) ^ 2 for all points. Actually, hence the name “least squares method”. The criterion is not the best and not the only, but quite acceptable as a default method. Its peculiarity is that it is sensitive to emissions and thus, considers such models to be of lower quality. There are all sorts of methods for the smallest modules, etc., but for now we don’t need it yet.

Simple linear regression

The easiest case. We have one predictor variable and one dependent variable. In our case, it can be for example height and weight. We need to construct the equation W_p = a * L + b, i.e. find the coefficients a and b. If we carry out this calculation for each sample, then W_p will coincide as much as possible with W for the same sample. That is, we will have the following equation for each girl:
W_p_i = a * L_i + b
E_i = (W_p-W) ^ 2

The total error in this case will be sum (E_i). As a result, for optimal values of a and b, the sum (E_i) will be minimal. How to find the equation?

Matlab

To simplify, I highly recommend putting an Excel plugin called Exlink. It is in the matlab / toolbox / exlink folder. Very easy to transfer data between programs. After installing the plugin, another menu appears with an obvious name, and Matlab automatically starts. The transfer of information from Excel to Matlab is initiated by the command “Send data to MATLAB”, back, respectively, - “Get data from MATLAB”. We send numbers to the Matlab from column L and separately from W, without headers. Variables are called the same. The linear regression calculation function is polyfit (x, y, 1). The unit indicates the degree of the approximation polynomial. We have it linear, so the unit. We finally get the regression coefficients: regr=polyfit(L,W,1) . a we can get as regr (1), b - as regr (2). That is, we can get our values W_p: W_p=L*repr(1)+repr(2) . Let's get them back to Excel.

Graphic

Hmm, sparsely. This is a graph of W_p (W). The formula on the graph shows the relationship of W_p and W. Ideally, there will be W_p = W * 1 + 0. The sampling of the initial data has come out - a cloud of checkered points. Neither the correlation coefficient is in an arc — the data are weakly correlated with each other, i.e. Our model poorly describes the relationship of weight and height. On the graph it can be seen as points located in the form of a weakly stretched along a straight cloud. A good model will give a cloud stretched into a narrow lane, even worse - just a chaotic set of points or a round cloud. The model must be supplemented. About the correlation coefficient should be told separately, because it is often used completely wrong.

Matrix calculation

It is possible to cope with the construction of regression without any polyphytes, if we slightly add a column with growth values with one more column filled with units: L(:,2)=1 . Deuce shows the number of the column in which the units are written. Then the regression coefficients can be found using the following formula: repr=inv(L'*L)*L'*W And back, find W_p: W_p=L*repr . When you realize the magic of matrices, it becomes unpleasant to use functions. A single column is needed to calculate the free regression term, that is, simply a term without multiplying by a parameter. If you do not add it, then there will be only one member in the regression: W_p = a * L. It is quite obvious that it will be worse in quality than a regression with two terms. In general, it is necessary to get rid of a free member only if it is definitely not needed. By default, it is still present.

Multiline regression

In the Russian-language literature of the past years it is referred to as MMNK, the method of multiple least squares. This is an extension of the least squares method for several predictors. That is, not only growth, but also all the other horizontal dimensions, so to speak, go into our business. The preparation of the data is exactly the same: both matrices in matlab, adding a column of units, calculation using the same formula. For lovers of functions, there is b = regress(y,X) . This feature also requires the addition of a column of units. We repeat the calculation according to the formula from the section about the matrix, we send it to Excel, we look.

Attempt number two

And so better, but still not very. As you can see, cellularity remained only horizontally. Nowhere is it, the initial weights were whole numbers in pounds. That is, after conversion to kilograms, they fall on the grid in increments of about 0.5. Total final view of our model:

W_p = 0.2271 * S + 0.1851 * T + 0.3125 * B + 0.3949 * L - 72.9132

Volumes in centimeters, weight in kg. Since we have all the values except for growth in some units of measurement and about the same order of magnitude (except for the waist), we can estimate their contributions to the total weight. The reasoning is approximately like this: the coefficient at the waist is the smallest, as well as the values themselves in centimeters. This means that the contribution of this parameter to the weight is minimal. In the bust, and especially in the hips, it is larger; a centimeter at the waist gives less weight gain than on the chest. And most of all the weight affects the volume of the ass. However, any interested man knows this. That is, at least, our model of real life does not contradict.

Model validation

The name is loud, but we will try to get at least the approximate weights of those girls for whom there is a full set of sizes, but no weight. Their 7: from May to June 1956, July 1957, March 1987, August 1988. We find the weight predicted by the model: W_p=X*repr

Well, at least in text form it looks plausible. And how much this corresponds to reality - you decide

Applicability

In short, the resulting model is suitable for objects similar to our data set. That is, according to the obtained correlations, one should not consider the parameters of the figures of women with a weight of 80+, an age that is very different from the average for the hospital, etc. In real-world applications, we can assume that the model is suitable if the parameters of the object being studied are not too different from the average values of the same parameters for the initial data set. There may be (and will be) problems if we have predictors that are strongly correlated with each other. That is, for example, this is the height and length of the legs. Then the coefficients for the corresponding values in the regression equation will be determined with low accuracy. In this case, you need to throw away one of the parameters, or use the principal component method to reduce the number of predictors. If we have a small sample and / or a lot of predictors, then we risk falling into a model redefinition. That is, if we take 604 parameters for our sample (and there are only 604 girls in the table), then we can analytically obtain an equation with 604 + 1 terms, which will exactly describe what we have thrown into it. But the predictive power of him will be very small. Finally, not all objects can be described by multiline dependence. There are logarithmic and power, and all sorts of complex. Their search is another question.

Future plans

If it goes well, I will try in the same style to describe the principal component method for reducing the dimensionality of the data, regression on the main components, the PLS method, the beginning of cluster analysis and methods for classifying objects. If the habrapublic is not very well received, then I will try to take into account the comments. If there is any way at all, I will score for the enlightenment of Shirnarmass in general, I have enough of my students. See you again!

Source: https://habr.com/ru/post/172043/

All Articles