📜 ⬆️ ⬇️

Introduction to R-project

R-project logo Throughout Habré, only a couple of articles on the above topic were found. And the theme is fertile. And last Wednesday the course " Introduction to Computational Finance and Financial Econometrics " just ended. Based on his fifth week, “Descriptive statistics”, this post appeared. The participant will be uninteresting, and those wishing to get acquainted with the basic techniques of data analysis with the help of R - I ask for a habrak.

Preliminary agreements


About terms

The author from the statistics had only a semester of “Terver” N years ago. Therefore, after doubtfully translated words and their combinations, the original English term will be indicated ( in italics in brackets ). Experts, please, send in lichku more correct variants of terms. Thank.

About installation

Attention is not focused on installing software intentionally, in view of the triviality. At least on the Windows platform, it all came down to the standard “further -> further -> ... -> ready.” The only PerformanceAnalytics package required for the article being executed in the article is installed via the “Packages / Install Package (s) ...” menu), select the mirror nearest to you, and select the desired package from the list.

Data set


I wanted to avoid typicality: sales, apartments, simple returns , - how much is possible? Therefore, the subject area of ​​our sample is eternal both in the context of Habr and outside its context. Not so long ago, a survey “What size do you have breasts?” Was published on SamiKy blog. Given that it included two answer choices for eliminating an irrelevant audience, there is some confidence in the likelihood of the sample. For convenience, the results are shown here:
The results of the survey at the topic

purpose


As part of our mini-study, we compare 2 data sets with a normal distribution:

Experienced statisticians it is obvious that the changes of the second option, we distribute the distribution from the normal. By the end of the article, we will have enough information to formally justify it.
')

Research progress


First, let's put our data sets into variables:
data = c(rep(0, 184), rep(1, 510), rep(2, 996), rep(3, 763), rep(4, 327), rep(5, 147), rep(6, 60)) data_ol = c(data, rep(0, 51), rep(7, 65)) x.txt = " " #       

The function c "sticks together" its arguments into a single vector; the function rep (x, y) returns a vector from y of x values. For example, rep (0, 184) returns a vector of one hundred eighty-four zeros. In the recommendations of Google and in several other sources, there was an opinion that it is useless to use the symbol of equality for assignment, it is better - "<-". Knowledgeable people, please state in the comments sufficiently strong justification to write 2 characters instead of one. For the author personally, this alternative renders the inconvenience of the ": =" operator from the Pascal language.

Now you can build histograms:
 par(mfrow=c(1, 2)) hist(data, breaks=0:7, right=F, col="seagreen", main=" 1", xlab=x.txt, ylab=" ") hist(data_ol, breaks=0:8, right=F, col="slateblue1", main=" 2", xlab=x.txt, ylab=" ") 

The first line is needed to display the histograms side by side. Without it, the second histogram will overwrite the first. Here is what happened:
Sample Histograms
Reminds the result of the survey, right? True, a feature of our research is that the data is generated on the basis of a histogram. But this step is not meaningless, because
  1. LJ has a non-linear scale (most likely due to the number of votes in the first answer);
  2. Both histograms are depicted on the same scale and oriented vertically, which allows us to make a comparison with the probability density function of the normal distribution .

The next step makes little sense for such a discretized dataset as ours. It is shown here only for familiarization with the density function, which builds a more “smoothed” (read, averaged) histogram on the sample.
 plot(density(data), type="l", col="seagreen", lwd=2, main="  1") plot(density(data_ol), type="l", col="slateblue1", lwd=2, main="  2") 

Result:
Smooth probability density

Calculate the sample parameters of the distributions.
 mu = mean(data) mu mu_ol = mean(data_ol) mu_ol var(data) var(data_ol) sig = sd(data) sig sig_ol = sd(data_ol) sig_ol library(PerformanceAnalytics) skewness(data) skewness(data_ol) kurtosis(data)# excess kurtosis (-3) kurtosis(data_ol) 

Results:
NDExpectationDispersionStandard deviationSkewness asymmetryExcess kurtosis
one2.4084371.7085421.3071120.41244430.1001578
22.4650342.178581.4760010.71987670.7943986

As can be seen from the table, changes in the second data set

Let us compare the empirical distribution functions (EGF) with the distribution functions ( cumulative distribution function ) of the corresponding normal distributions (N (2.408437, (1.307112) 2 ) and N (2.465034, (1.476001) 2 )).
 n1 = length(data) plot(sort(data), (1:n1)/n1, type="S", col="seagreen", main=" 1", xlab=x.txt, ylab="") x = seq(0, 6, by=0.25) lines(x, pnorm(x, mean=mu, sd=sig), type="l", col="orange", lwd=2) n2 = length(data_ol) plot(sort(data_ol), (1:n2)/n2, type="S", col="slateblue1", main=" 2", xlab=x.txt, ylab="") x2 = seq(0, 7, by=0.25) lines(x2, pnorm(x2, mean=mu_ol, sd=sig_ol), type="l", col="orange", lwd=2) 

Conclusion:
Empirical distribution functions
From the distribution functions we move on to quantiles ( quantile ), inverse to the distribution functions.
 quantile(data) quantile(data_ol) qnorm(p=c(0, .25, .5, .75, 1), mean=mu, sd=sig) qnorm(p=c(0, .25, .5, .75, 1), mean=mu_ol, sd=sig_ol) 

In our particular case, the stage is rather boring, because Samples differ only by the hundredth percentile:
Distributionq 0q .25q .5q .75q 1
ND102236
N (2.408437, (1.307112) 2 )-Inf1.5268032.4084373.290070Inf
ND202237
N (2.465034, (1.476001) 2 )-Inf1.4694862.4650343.460582Inf

And if ND1 quartiles resemble the normal distribution, at least with rounding, then ND2 even this does not help.

The quantile scheme ( normal QQ plot ) is not very useful for our highly discretized samples. Mentioned in order to illuminate the function qqnorm.
 qqnorm((data-mu)/sig, col="seagreen") abline( 0, 1, col="orange", lwd=2) qqnorm((data_ol-mu_ol)/sig_ol, col="slateblue1") abline( 0, 1, col="orange", lwd=2) 

The result is not exciting, but cheerful:
Quantile scheme

And completes the list of visual findings box with a mustache ( boxplot ).
 boxplot(data, outchar=T, main="   1", ylab=x.txt) boxplot(data_ol, outchar=T, main="   2", ylab=x.txt) 

Graphics:
Mustache crates
The construction clearly reflects the robust characteristics of the sample (resistant to the presence of outliers ):

In this case, the confidence interval is considered approximately as the distance from the first / third quartile by 1.5 interquartile range. For details -? Boxplot.

Conclusion


ND1 deviates less than ND2 from the normal distribution in mind:


Additional Information


Alternative introductory materials in R (eng.):

The second and third links are part of the official documentation. If there are links to efficient introductory articles on the great and mighty, write - add.

The main purpose of the article is to attract public attention to R as an analysis tool. If any of the knowledgeable people presents more in-depth material, I will be sincerely glad and happy to get acquainted.

Proofreaders - in a personal. The rest - welcome to the comments.
Thank you all for your attention.

Source: https://habr.com/ru/post/160373/


All Articles