📜 ⬆️ ⬇️

Statistical Computing Environment R: Experience in Teaching

I want to talk about using the free statistical analysis environment R. I consider it as an alternative to statistical packages such as SPSS Statistics. To my deep regret, it is completely unknown in the expanses of our Motherland, but in vain. I believe that the possibility of writing additional statistical analysis procedures in the S language makes the R system a useful tool for data analysis.

In the spring semester of 2010, I had the opportunity to give lectures and conduct practical classes on the course “Statistical Data Analysis” for students of the department of intelligent systems at RSUH.

My students had previously studied the semester course in probability theory, covering the basics of discrete probability spaces, conditional probabilities, Bayes theorem, the law of “large numbers”, some information about the normal law, and the Central Limit Theorem.
')
About five years ago I already conducted classes on the (then still combined) semester course on the Fundamentals of Probability Theory and Mathematical Statistics, so I expanded my notes (given out before each lesson to students) on statistics. Now, when the RSUH has a student server at the isdwiki.rsuh.ru branch, I simultaneously upload them to FTP.

The question arose: what program to use for practical classes in a computer class? Often used Microsoft Excel was rejected as due to proprietary, and because of the incorrect implementation of some statistical procedures. You can read about this, for example, in the book of A.Makarov and Yu.N.Tyurin “Statistical analysis of data on a computer”. Calc spreadsheets from the free office suite Openoffice.org were Russified in such a way that I can hardly find the required function (their names were also abhorrent).

The most commonly used package is SPSS Statistics. SPSS is currently absorbed by IBM. Among the advantages of IBM SPSS Statistics I will highlight:

The disadvantages of IBM SPSS Statistics in my opinion are:

As an alternative, I chose the R system. This system began to be developed by the efforts of Robert Gentleman and Ross Ihak at the Faculty of Statistics of the University of Melbourne in 1995. The first letters of the names of the authors determined its name. Subsequently, leading statisticians joined the development and expansion of this system.

I consider the advantages of the system under discussion:

For the first lesson, CDs were prepared on which the installation files, documentation and manuals were written. On the latter I will say more. CRAN has detailed user guides for installation, R language (and its subset S), writing additional statistical procedures, exporting and importing data. In the Contributed Documentation section there are a large number of publications by statistical teachers who use this package in the learning process. Unfortunately, there is nothing in Russian, although, for example, there is even in Polish. From the English-language books, I would like to mention “Using R for introductory statistics” by Professor John Wersany from the City University of New York and “Introduction to the R project for Statistical Computing” by Professor Rossiter (Holland) from the International Institute of Geoinformatics and Earth Observations.

The first lesson was devoted to installation and learning to use the package, familiarity with the syntax of the language R. As a test problem, calculations of the integrals using the Monte Carlo method were used. Here is an example of calculating the probability of sv. with exponential distribution with parameter 3, take a value less than 0.5 (10,000 is the number of attempts).
> x=runif(10000,0,0.5)
> y=runif(10000,0,3)
> t=y<3*exp(-3*x)
> u=x[t]
> v=y[t]
> plot(u,v)
> i=0.5*3*length(u)/10000

image

The first two lines define the uniform distribution of points in the rectangle [0.0.5] x [0.3], then select those points that fall under the graph of the exponential density 3 * exp (-3 * x), the plot function displays the points in the graphic output window Finally, the desired integral is calculated.
The second lesson was devoted to calculating descriptive statistics (quantiles, median, average, variance, correlation and covariance) and plotting graphs (histograms, box-with-whiskers).
In subsequent sessions, the library “Rcmdr” was used. This is a graphical user interface (GUI) for R. The library is being created by the efforts of Professor John Fox of McMaster University in Canada.
image

This library is installed by running the install.packages command (“Rcmdr”, dependencies = TRUE) within environment R. If the medium itself is an R language interpreter, the add-in “Rcmdr” is an additional window equipped with a menu system containing a large number of commands corresponding to standard statistical procedures. This is especially convenient for courses where the main thing is to teach the student to press the buttons (to my regret, there are more and more of them now).

From my previous course the notes to the seminars were expanded. They are also available via FTP from the site isdwiki.rsuh.ru. These notes contained tables of critical values ​​that were used for calculations at the board. This year, students were asked to solve these problems on a computer, and also to check the tables using the (normal) approximations, also indicated in the notes.

There were some of my blunders. For example, I realized too late that Rcmdr allows you to import data from loaded packages, so relatively large samples were processed only in regression analysis classes. In presenting non-parametric tests, these students typed in using my notes. Another drawback, as I now understand, was the insufficient number of homework for writing fairly complex programs in R.

It should be noted that several senior students attended my classes, and some downloaded lecture and seminar materials. The students of the intellectual systems department of the RSUH receive fundamental training in mathematics and programming, so the use of the R environment (instead of spreadsheets and statistical packages with fixed statistical procedures) seems to me very useful.

If you are faced with the task of studying statistics, and especially writing non-standard procedures for statistical data processing, then I recommend turning your attention to the R package.

Source: https://habr.com/ru/post/92135/


All Articles