📜 ⬆️ ⬇️

Correlation Analysis or Why Strange Correlations Exist

The publication “Money, goods, and some statistics led me to this opus . Part Two ” , in which the author investigated the relationship between the prices of various goods. It was somewhat embarrassing that, despite the masterful handling of MatLab, the author never mentioned the level of significance of the obtained correlations. Indeed, the relationship between the two quantities may exist, but if it is not statistically significant , we can talk about it only in the context of reasoning and speculation.

I couldn’t touch the data with my “hands” for a long time, but then I had a free hour and, armed with R, I set off.

d = read.csv("data.csv", sep = ";") #   names(d) <- c("time","oil", "gold", "iron", "logs", "maize", "beef", "chicken", "gas", "liquid_gas", "tea", "tobacco", "wheat", "sugar", "soy", "silver", "rice", "platinum", "cotton", "copper", "coffee", "coal", "aluminum") #    #        () -     . #      R     ,  : gm_mean = function(x, na.rm=TRUE){ exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x)) } d.gm = apply(d[,2:23], 2, gm_mean) #        dt = d[,2:23]/d.gm #    apply(dt, 2, shapiro.test) #    cor.m = cor(dt, method = "spearman") #    


An important point is that the distribution of normalized prices for all goods differed from the normal one (the p-value for the Shapiro – Wilk criterion is significantly less than 0.001), which inexorably leads us to the fact that using the “good” Pearson correlation coefficient for searching for interrelations is not possible. Fortunately, there is its non-parametric counterpart - the Spearman test.
')
So, the correlation matrix is ​​obtained. Take a look at her:

Piccy.info - Free Image Hosting

Okay, there are correlations, although the rho values ​​are already smaller. Find the highest levels and check their significance:

 out <- data.frame(X1 = rownames(cor.m)[-1], X2 = head(colnames(cor.m), -1), Value = cor.m[row(cor.m) == col(cor.m) + 1]) for(x in 1:length(out$X1)) { print( cor.test( dt[as.character(out[x,1])][[1]], dt[as.character(out[x,2])][[1]], method = "sp")$p.value) } 


To save space, I would say that for all the correlations found, the p-value was less than 0.0001, which indicates a statistically significant phenomenon. The correlation matrix is ​​presented below:

1 gold oil 0.2451402
2 iron gold 0.2503873
3 logs iron 0.2446200
4 maize logs 0.2547667
5 beef maize 0.2398418
6 chicken beef 0.2385301
7 gas chicken 0.2481030
8 liquid_gas gas 0.2544752
9 tea liquid_gas 0.2367907
10 tobacco tea 0.2416664
11 wheat tobacco 0.2553935
12 sugar wheat 0.2505641
13 soy sugar 0.2440920
14 silver soy 0.2589974
15 rice silver 0.2403048
16 platinum rice 0.2418105
17 cotton platinum 0.2343923
18 copper cotton 0.2498545
19 coffee copper 0.2321891
20 coal coffee 0.2482226
21 aluminum coal 0.2423581


As you can see, the received rho does not exceed 0.3, which indicates a weak bond strength (according to the Cheddock scale). In fact, it is possible to operate with such data, but you always need to understand that fluctuations in the prices of a single commodity will not affect the price of its “partner” in correlation no more than 10%.

I would like to note that a similar line of reasoning should be used in the analysis of other strange correlations . Figures can play with us evil jokes.

Thank you jatx for giving reason to play with numbers!

Source: https://habr.com/ru/post/241967/


All Articles