📜 ⬆️ ⬇️

Correlation, covariance and deviation (part 3)



In the first part, we talked about the essence of the transformation of the deviation and its application to the matrix of squares of distances. In the second, a little fog was put on the spectra of simple geometric sets.

In this article, we will try to uncover the meaning of the deviation transformation, for which we turn to the applied problems associated with data processing and analysis. Let us show how the transformation of the deviation of the distance matrix with statistics is related to variance , correlation and covariance .

7. Centering and rationing of one-dimensional coordinates


We will warm up in a simple and understandable way - centering and normalizing the data. May we have a series of numbers . Then the centering operation is reduced to finding the average (set centroid)
')


and the construction of a new set as the difference between the original numbers and their centroid (average):



Centering is the first step to the original coordinate system (FCS) of the original set, since the sum of centered coordinates is 0. The second step is normalizing the sum of squares of centered coordinates to 1. To perform this operation, we need to calculate this amount (or more precisely, the average):



Now we can construct the FCS of the original set as a set of the eigenvalue S and the normalized numbers (coordinates):



The squares of the distances between points of the original set are defined as the differences of the squares of the components of the eigenvector multiplied by the eigenvalue. Note that the eigenvalue S turned out to be equal to the variance of the initial set (7.3).

So, for any set of numbers, you can determine your own coordinate system, that is, select the value of the eigenvalue (also known as variance) and calculate the coordinates of the eigenvector by centering and normalizing the original numbers. Cool.

Exercise for those who love to "feel with their hands." Build ssk for the set {1, 2, 3, 4}.
Answer.
Own number (variance): 1.25.
The eigenvector: {-1.342, -0.447, 0.447, 1.342}.

8. Centering and orthoration of multidimensional coordinates


What if instead of a set of numbers we are given a set of vectors - pairs, triples and other dimensions of numbers. That is, a point (node) is defined not by one coordinate, but by several. How in this case to construct SSC?

Yes, you can build a matrix of squares of distances, then determine the deviation matrix and calculate the spectrum for it. But we learned about this not so long ago . Usually they did (and do) differently.

We introduce the notation of the component set. We are given points (nodes, variables, vectors, tuples) and each point is characterized by numerical components . Please note that the second index Is the number of the component (matrix columns), and the first index - the number of the point (node) of the set (matrix row).

What do we do next? That's right - we center the components. That is, for each column (components) we find the centroid (average) and subtract it from the value of the components:





We got a centered data matrix (IDC) .
The next step, we seem to need to calculate the variance for each component and normalize them. But we will not do this. Because although in this way we really get normalized vectors, we need something to make these vectors independent, that is, orthonormal . The operation of valuation does not rotate the vectors (but only changes their length), and we need to expand the vectors perpendicular to each other. How to do it?

The correct (but still useless) answer is to calculate the eigenvectors and numbers (spectrum). Useless because we have not built a matrix for which the spectrum can be considered. Our centered data matrix (IDC) is not square - you cannot calculate eigenvalues ​​for it. Accordingly, we need to build a square matrix on the basis of the MCD. This can be done by multiplying the IDC by itself (squaring it).

But here - attention! Non-square matrix can be squared in two ways - by multiplying the original by the transposed one . And vice versa - by multiplying the transposed by the original. The dimension and meaning of the two matrices obtained are different.

Multiplying the IDC by the transposed, we get the correlation matrix:



From this definition (there are others) it follows that the elements of the correlation matrix are scalar products of centered vectors. Accordingly, the elements of the main diagonal reflect the square of the length of these vectors.
The values ​​of the matrix are not normalized (they are usually normalized, but for our purposes this is not necessary). The dimension of the correlation matrix coincides with the number of initial points (vectors).

Now, we interchange the matrices multiplied in (8.1) and obtain the covariance matrix (again, omitting the factor 1 / (1-n) , which is usually normalized by the values ​​of covariance):



Here components are multiplied (and not vectors). Accordingly, the dimension of the covariance matrix is ​​equal to the number of initial components. For pairs of numbers, the covariance matrix has a dimension of 2x2, for triples it is 3x3, and so on.

Why is the dimension of correlation and covariance matrices important? The point is that since the correlation and covariance matrices come from the product of the same vector, they have the same set of eigenvalues, the same rank (the number of independent dimensions) of the matrix. As a rule, the number of vectors (points) far exceeds the number of components. Therefore, the rank of the matrices is judged by the dimension of the covariance matrix.

The diagonal elements of covariance reflect the variance of the components. As we saw above, variance and eigenvalues ​​are closely related. Therefore, it can be said that in the first approximation, the eigenvalues ​​of the covariance matrix (and, hence, correlations) are equal to the diagonal elements (and if there is no inter-component dispersion, then they are equal in any approximation).

If the task is to find simply the spectrum of matrices (eigenvalues), then it is more convenient to solve it for the covariance matrix, since, as a rule, their dimension is small. But if we also need to find eigenvectors (define our own coordinate system) for the initial set, then we need to work with the correlation matrix, since it is this that reflects the multiplication of vectors. It is possible that the optimal algorithm is a combination of the diagonalizations of the two matrices — first, find the eigenvalues ​​for covariance and then determine the eigenvectors of the correlation matrix on their basis.

Well, since we have gone so far, we will mention that the notorious method of principal components consists in calculating the spectrum of the covariance / correlation matrix for a given set of vector data. The found spectrum components are located along the main axes of the data ellipsoid. From our consideration this follows because the main axes are those axes, the dispersion (scatter) of the data for which is maximum, and therefore the maximum value of the spectrum.

True, there may be negative variances, and then the analogy with an ellipsoid (pseudo-ellipsoid?) Is no longer obvious.

9. The distance deviation matrix is ​​a vector correlation matrix.


All this is fine, but where is the transformation of the deviation?

Consider the situation when we know not a set of numbers (vectors) characterizing some points (nodes), but a set of distances between points (and between all). Is this information sufficient to determine the SSC (own coordinate system) of the set?

We gave the answer in the first part - yes, completely. Here we will show that the matrix of deviations of squares of distances constructed by the formula (1.3 ') and the matrix of correlation of centered vectors defined by us above (8.1) is the same matrix .

How did this happen? Themselves in shock. To verify this, we must substitute the expression for the element of the matrix of squares of distances



in the deviation conversion formula:



Note that the average value of the matrix of squares of distances reflects the variance of the original set (provided that the distances in the set are the sum of the squares of the components):



Substituting (9.1) and (9.3) into (9.2), after simple abbreviations, we arrive at the expression for the correlation matrix (8.1):



So, we have seen that applying the deviation operation to the matrix of Euclidean distances, we get the known correlation matrix. The rank of the correlation matrix coincides with the rank of the covariance matrix (the number of components of the Euclidean space). This circumstance allows us to build the spectrum and our own coordinate system for the original points based on the distance matrix.

For an arbitrary distance matrix (not necessarily Euclidean), the potential rank (number of dimensions) is one less than the number of source vectors. The calculation of the spectrum (own coordinate system) allows you to determine the main (main) components that affect the distance between points (vectors).

The matrix of distances between cities, for example, is obviously non-Euclidean, - no components (characteristics of cities) are specified. The deviation transformation nevertheless allows us to determine the spectrum of such a matrix and the own coordinates of cities.

But not in this article. That's all for now, thanks for your time.

Source: https://habr.com/ru/post/263907/


All Articles