📜 ⬆️ ⬇️

My experience of introducing R or "I Love R"

I am a scientist [ more on this here ]. "Proletarian intellectual labor." By education a physicist. I am working in the field of medical and biological information processing for 30+ years.
I have been working in R for exactly 10 years, having migrated to it after 15 years of close cooperation with Matlab. The primary reason for migrating to another work platform was my own physical migration to the opposite edge of the Earth in Auckland, New Zealand. Here, life from the first days pushed me into the arms of R, which I have not had to regret.

Increasingly, I am seeing flashes of interest in R on a professional Internet. Well, on this respected resource articles about him appear. Next, under the cut, my first attempt at the Russian-language introduction to R is the first (verbal) part of the presentation that I made for my colleagues at the Faculty of Animal Science, Iova State University three years ago.
( aside: how, it turns out, it is difficult to translate yourself ...)

<img src = " "alt =" image "/>
In this post

What is R


First of all, R is a system for statistical and other scientific calculations, using the programming language S.

S is a language written by statisticians for statisticians. by definition of the author John Chambers. Since its introduction, the language has been very well received and tested by generations of very picky statistical users. We can assume that it is widely known and accepted in the global statistical community. A number of critical epidemiological, ecological, and financial models have been implemented and are still being exploited in S language around the world and in many industries. As a language from the point of view of me as a “writing user,” S is a very pleasant alternative to the SAS language.
')
From my own experience - I got acquainted with the first lessons of S myself in the early 90s from WHO statistical experts, with whom I overlapped on scientific research of that time.

By many estimates, R (as for me - not greatly exaggerated) - is one of the most successful open source projects, distributed freely from dozens of mirrors around the world according to the GNU license standards.
The authors respond with a categorical refusal to all proposals for the commercialization of the project, although today there is reason to assume that the number of installed copies of R in the world exceeds the total number of copies of all other systems of statistical analysis.

From the very beginning and to this day, the project gives me the deepest respect (on the verge of admiration) for stability, user support, compatibility of codes, etc., which I would combine in the concept of culture .
However, the last sentence is more likely for subsequent subsections.

Where did S come from and what does this have to do with R


Undoubtedly, Wikipedia will give you many more letters.
I will only note what I consider important for understanding the place of S and R in this life in this world.

Bell Labs (aka Bell Labs, AT & T Bell Labaratories) are quite famous in the history of science and technology, and IT Co. in particular. Statistical studies there have always been put very seriously and also seriously supported by all available computer tools (read with tons of Fortran and Lisp code).

What later became the S language emerged in the 70s on the initiative and under the guidance of John Chambers, as a set of scripts that facilitate the “feeding” of data to Fortran code. Those. At the forefront was the task of interactive data manipulation, compactness, pleasantness in writing and readability of the code and obtaining a decent output to various devices of tables and graphs.

The syntax of the language provides for the construction of almost arbitrarily complex data structures, means for describing specific statistical tasks and objects - stat. tests, models, etc.

Since 1984, the language has acquired a name, its own “Bible” (the book of Chambers and Beckers: S: An Interactive Environment for Data Analysis and Graphics has been published), has by default contained almost complete “gentleman's set” statistics and “probabilistic” - distributions, generators random numbers, statistical tests, many standard statistical analyzes, work with matrices, etc., not to mention the advanced system of scientific graphics. The most important thing is that it has become available to users around the world for a very reasonable price.

In 1988 (another book The New S Langugage was published ) - modified using OOP, everything became objects with very reasonable default values, accessibility for modification, elements of self-commenting, etc., etc.

At the same time, the laboratories published the source code and “Bell Labovsky” S became free for students and for use in scientific purposes. It was all somehow related to AT & T's "dispossession of the kulaks", but these details were not very interesting for me.

There were, and probably still exist, commercial implementations of the S language. I came across S-Plus and S2000 . They were supported at different times by different companies, mostly living (living?) Due to the support of previously created S applications. In these post-Bell versions of S , a new version of the OOP engine appeared, but for a clean user this was almost bloodless in terms of compatibility of the historical code.

R is the only non-commercial fully independent (from the original Bellovskaya) implementation of the S language.

And according to a rare agreement in our time, in some unimaginable way for me, the developers of the current versions of commercial S and non-commercial R support their almost complete compatibility and continuity.

And now R



Behind any significant event in this life is some kind of charismatic personality. However, this can happen and there is a definition of significance of the phenomenon.

In the case of R, there are three such people.
I already said about John Chambers.

Ross Ihaka ( Ross Ihaka ) - a student, and then a researcher at the Department of Statistics of Auckland University, with the topic of his dissertation (which was carried out at MIT, USA), chose to research the possibility of building a virtual machine (VM) for statistical programming languages. Lisp ( Common Lisp, CL ) was chosen as an intermediate language and it implemented a VM prototype that “understands” small subsets of SAS and S.
Finishing his thesis, Ross returned to Auckland, where he soon met Robert Gentelmen and got carried away with the R. project.
Ross has not defended his thesis, but already has a degree from several universities "in the aggregate of merit." Last year, he was awarded the title and he was an Associate Professor in his home university.

Robert Gentleman , another statistician with a passion for programming, originally from Canada, while on an internship at Auckland University (he then worked in Australia), suggested Ross to "write some tongue."
According to the legend that I myself heard from these “founding fathers”, almost in a month, in a rush of insane enthusiasm, they rewrote almost all S commands to CL , including a powerful linear modeling library.

Following the traditions of the prototype, the computational engine R has chosen the well-known, universally recognized and free BLAS library (with the possibility of using ATLAS, etc., with the same interface).
Paul Murrel, one of Ross's closest friends and also an employee of Auckland Uni, discarded and wrote (it seems, in C) a graphics engine from scratch, fully reproducing the functionality of itself in S.

The result was a free, full-featured bag that instantly gained a place in the educational process at Aaklanla University, fully compliant with the descriptions in Chambers' very detailed and high-quality books, which were traditionally published in paperbacks and medium print quality, but were cheap and accessible.
Several GNU activist groups (for example, GIS) of the movement adopted R as a platform for scientific computing.

But R really gained wide popularity in bioinformatics, when one of the "fathers" Robert Gentleman, involved at the time in the work of Affimmetrix , duplicated all the functionality of the company's commercial software and launched (well, not one, of course) the open source project Bioconductor . Currently, Bioconductor is the undisputed leader of the bioinformatics open source for all "-omics" (genomics, proteomics, metabolomics etc.).

A single interface language for this riot of bioinformatic fantasies was, of course, R.

The circle was closed when the retired Chambers, the creator of the S language, entered as an active member of the R active development group as a full member.

Why do I love him (list)


  1. Interactivity, “Data Programming” is my favorite work style.
  2. Elegant (for an amateur) language - I like lists, data frames, functional programming and lambda functions (a-la). Freedom of expression: the same problem can be solved in ten ways (softens the feeling of routine)
  3. “Soberly looks at this world” - rarely “falls” or someone “hangs up”, logical operations with missing data, error handling at run time (try-error), easy exchange with the system at the level of standard I / O, etc.
  4. Complete set of ready-to-use statistical procedures
  5. Well documented and well maintained — compatibility, continuity, etc.
  6. Gathered around a humanly pleasant professional community (forums, user conferences, etc.)
  7. Well-documented interface for external libraries and functions on anything - Fortran, C, Java. Hence, there is a sea of ​​well-documented libraries in all aspects of statistics and data processing in almost all areas of science, but with a primary focus on bioinformatics / biostatistics; everything is regularly and correctly updated, if there is an author's will
  8. The absence of a mandatory GUI in the "basic configuration" - Well, not a "mouse" I am a man!

Out of the list: I'm just pleased that my main work tool has ... a soul.
What I, in fact, am trying to show in my article.

Why and how do I use it (examples)


Began to write in this section, but stopped.
Otherwise, I would never have finished.
Oh, probably like something later.

Myths and truth


R slow

R - “thin”, uses blas / lapack / atlas library for calculations, try to write something faster than these old kind of Fortran (often) “workhorses”. All critical functions, as a rule, use vector operations and are implemented in C.

R irrational use of computing resources, in particular - memory

Yes, the developers recognize such a sin. But the working time of a specialist is now more expensive than "iron". Unload from a modern working computer toys and with most real data sets you will not have problems with R.

Free software can not be reliable

May: Fortran, Linux, C, Lisp, Java etc.

Instead of Epilogue

As mentioned above, the post below is actually a translation of my presentation for a fairly specific target audience, and I will briefly describe this audience.

Many “clean” IT Co. will have to meet with such people, since the production of food to attract capital and generate profits has long been competing with oil and other energy sources. And the capacity of the bioinformatics market in medicine and pharmacology is limited, whatever one may say.

So, my audience is people, with basic education in genetics and breeding, veterinary medicine, less often - biology (mainly molecular). Uncles and aunts (the last are more), 20-30 years old ... programming (!) On FORTRANe or VB , famously managing with excel tables in 100k rows / columns and periodically dropping their tasks (and their programming) with their Linux computing 500 + nuclear cluster 12TB of shared memory and from time to time requiring the expansion of disk memory with another ten terabytes.

The methodological base is an explosive mixture of dispersions as ancient as the world with mixed models, solved in no other way than using the maximum likelihood method, “brain melting” Bayesian networks, etc.

Data - data tables from units to tens of thousands of rows, sometimes including 1-5 columns with phenotypes, but more often - tens or hundreds of "Ka" columns of variables that are weakly correlated among themselves and with phenotypes.

Well, yes, they also have a “good tradition” to consider everything in the aspect of kinship (genetics, after all). Family ties are traditionally represented as a matrix of “family ties” (pedigree) in size, for example, 40,000 x 40,000 (this is if 40,000 animals). Well or (so far, fortunately, only in the project) 20,000,000 x 20,000,000 is to “cover” with a single model all 20 million historical animals available in the database ( DB2 , if you are interested, and even Cobol is still “ sawed out "not from everywhere ...)

On the tables littered with the literature on (at the same time) Fortran, Java, C #, Scala, Octavia, Linux for Dummies, you can find recent graduates of bioinformatics. But somehow, quickly, many of them leave science for “coders”.

However, I know the case of reverse movement. So R is still useful to many.

Source: https://habr.com/ru/post/168817/


All Articles