📜 ⬆️ ⬇️

Practical application to "I Love R". Unplanned

I came to Habrozhiteli relatively recently.

I can’t take myself to mainstream IT, but I’m also pressing buttons, if you count Minsk-22 from the first school optional, already, read, ... 40 years ... (Oh my God, they don’t live as much!)

Oh yeah, I digress ...
I wrote my first post about R with trembling fingers (with excitement) and was very pleased with the feedback I received. That further reinforced the desire to fulfill the promise and show something from the practical application of R. And in particular from the field of bioinformatics, where R is the most popular, and where we work with him.
')
At the same time, it is not for the first time that I see that R is used as a language for “prototyping”, with the goal of later rewriting something on something “real” - C ++, or, at worst, on Python.

This particular post was kind of provoked by an article about the index method for calculating probabilities . Now I leave my quibbles to the statement in the article regarding the theory of probability (which is also not easy, having 20+ years of teaching, including the Terver).
Under the cut - an example of reducing the R-code from this article to the "working" (in my subjective opinion) state.



In short:
Dear kokorins wrote a useful article in my opinion about the index method of constructing a discrete random variable generator.
The article used this code to illustrate the “naive” version of the algorithm:
#Naive approach len<-10 ps<-seq(len,2, by=-1) ps<- 1/ps^2 ps<-ps/sum(ps) ss<-cumsum(ps) gen_naiv <- function() { alpha<-runif(1) return (min(which(alpha<ss))) } #sample val<-c() len<-10000 for(i in 1:len) { val<-append(val, gen_naiv()) } vals<-factor(val) plot(vals) 

I was interested in the algorithm, but the keywords that I “fished out” from the quoted post turned out to be of little specific, and the author’s last name, Chen, is almost the most common in China, and therefore one of the most numerous on the Internet ...

The author's “style” of programming in R is far from perfect (to put it mildly), but the article was not about R. At first I tried not to pay attention. But in trying to figure out the algorithm for copyright R-text, I lost patience and decided to use this text as a good example.

So, as already said, the code is working. But very bad. And I will try to show you why, and how to do better.

My first edit is just to format the final step of generating a random sequence as a function, which, in addition to returning a vector of random numbers, also prints the time taken to execute.
 len<-10 ps<-seq(len,2, by=-1) ps<- 1/ps^2 ps<-ps/sum(ps) ss<-cumsum(ps) gen_naiv <- function() { alpha<-runif(1) return (min(which(alpha<ss))) } #sample get_naive<-function(len=10000){ val<-c() st<-system.time( for(i in 1:len) { val<-append(val, gen_naiv()) }) print(st) return(val) } 

The last two lines of the author's code are omitted because they are intended only for graphics.

Having executed the specified code in the R environment, I can now get sequences of different lengths, at the same time asking about the time spent on the execution.
 > get_naive(10) user system elapsed 0.006 0.000 0.007 [1] 8 8 8 8 9 5 8 9 9 7 > get_naive(100) user system elapsed 0.008 0.000 0.008 [1] 8 1 7 8 8 4 9 8 7 8 9 9 9 9 5 7 9 9 8 3 7 1 1 9 7 7 9 9 5 8 6 6 4 9 9 6 9 8 9 8 6 8 6 9 2 8 9 9 9 9 7 8 5 9 6 6 5 9 9 8 9 9 7 8 6 8 9 1 8 9 [71] 9 3 5 9 8 9 9 9 9 6 9 9 5 9 8 7 8 9 9 9 8 9 7 9 8 7 8 5 5 8 > 

After admiring the thousandths of a second spent on execution and on the pseudo-random sequence of numbers from 1 to 9, I “hide” the output of the function in the variable val, but I will continue the experiments on increasing the sample length:
 > val<-get_naive(1000) user system elapsed 0.021 0.000 0.021 > val<-get_naive(10000) user system elapsed 0.342 0.011 0.354 > val<-get_naive(100000) user system elapsed 24.86 8.55 33.48 

Ah ah ah!
Looks like a mess! Somehow the execution time is growing very cool.
Could these really be problems with a naive approach? Does the algorithm of the name of a Chinese friend work faster?

I make the final author's version of the program, complementing it only for the purpose of measuring time. I launch it with different lengths of the desired samples ...
 > get_chen(1000) user system elapsed 0.031 0.001 0.032 > get_chen(10000) user system elapsed 0.508 0.063 0.567 > get_chen(100000) user system elapsed 46.303 5.946 53.088 

Op-pa! This is so unexpected ... disappoint, Comrade Chen. Something “improved” algorithm refuses to work better than “naive”. Or is it the author quoted by me could not bring the whole "brilliance"? I specifically did not show the next portion of bad code in my opinion. It is better not to get used.

My initial plan was to sort out the mistakes of a colleague step by step and show how it should be. But I decided to just rewrite the code with the comments (the “naive” version), as I would have done, relying on my experience with R and leave the kokorins to struggle with their own shortcomings or shortcomings of my own or Chinese code. Excuse me - just could not.

Here is the code that I got:
 M<-10 #    ps<- 1/(M:2)^2 #     ;   -   ps<-ps/sum(ps) #      ss<-c(0,cumsum(ps)) #     lbs=1:(M-1) #  "  " () --    ,     get_vladob <- function(n, brks=ss, labs=lbs) { st<-system.time( val<-as.numeric(cut(runif(n), breaks=brks, labels=labs)) ) print(st) return(val) } 

And the results of the "run"

 > val<-get_vladob(1000) user system elapsed 0.010 0.000 0.012 > val<-get_vladob(10000) user system elapsed 0.004 0.000 0.004 > val<-get_vladob(100000) user system elapsed 0.036 0.001 0.038 > val<-get_vladob(1000000) user system elapsed 0.581 0.029 0.606 > val<-get_vladob(10000000) user system elapsed 4.015 0.311 4.318 > val<-get_vladob(100000000) user system elapsed 38.405 3.092 41.578 

As you can see - not bad at all for a “slow” language.

I am now looking with interest towards other languages. I would see with interest the results of a similar test implemented in other environments. In particular, I am terribly curious whether a colleague will receive kokorins results in C ++ and which ones.

What has changed in the code?
First of all, I used the “vector” properties of R. For example, if I need 1000 values ​​of the runif function, I use runif (1000) and vector expressions to process, rather than 1000 times, I call one runif in each for loop. Believe me, this is significant for R.

The second. R is a functional programming language. When working with really large arrays of values, you understand how great it is when you can do without special efforts to place intermediate values ​​in the memory. Faster and more reliable in most cases this makes the system written by really strong programmers than (without offense) not always “straight-handed” users, albeit with knowledge of the basics of programming.

Third, I mentioned in my first post that R was written by “statisticians for statisticians.” The cut function assigns labels to values ​​in a vector, depending on which interval these values ​​fall into. The vector of borders of intervals is transferred to the cut function, as the breaks parameter. This is a fairly common operation in data processing (grouping, for example) to be implemented (and optimized!) In R as a separate function.

In general - R.
Please love and respect.

Source: https://habr.com/ru/post/180155/


All Articles