Introduction
Now it is almost impossible to imagine a world without parallel computing. Parallel to everything and everyone, even mobile phones now have several cores, which means ... well, you understand. But let's not talk about mobile applications, but about more useful and interesting things. About machine learning. The topic is also fashionable, advertised, machinewives have even heard about machine learning, and only the lazy have not touched it with their hands. For machine learning, and to be more precise, for statistical calculations there are many different frameworks, for my taste the best of them is R (forgive me, Octave fans). And it will be about him.
Disclaimer :
I do not pretend to be particularly rigorous; my task is to convey a general idea to the readers.
R is good and useful, but it has two limitations that almost do not interfere when working with a small amount of data and spoil life very much when working with large packages:
- all code is executed in one process
- all data is stored in memory
This is almost always the case if you don’t really think about it. And what if you think about it, I will tell you.
Work without cycles vs foreach
Those who wrote in R know that it is not generally accepted to use cycles in a language. Usually, programs use list operations (apply and its fellows), which in practice are more effective than usual for, because all magic can and does happen inside. Yes, and in the philosophy of the language, this approach fits much better, look:
a <- c(); for (i in 1:4) { a<- c(a, i^2) } # a <- 1:4 * 1:4 # , ?
If you need to do something non-trivial, you can apply apply. For example, the same result can be obtained as follows:
a <- sapply(1:4, function(x) {x^2})
The apply functions family allows many goodies that work instantly and are very pleasing to the eye. Of course, this approach requires some skill, but training pays off with a vengeance.
')
And where is the parallelism, you ask? This is where the most interesting lies.
Ordinary people came and thought: “All these apply, lists, etc. - it's good. But I do not want to relearn. Give me a familiar tool. ” And such a tool has appeared. You can find it in the foreach package.
Actually, we are interested in the function of the same name, the beauty of which lies in the fact that it can combine the results obtained at each step of the cycle (which the original for can not, see perversions with it above). And you can combine not just in some ways - foreach can feed your own combiner or any suitable function. The most frequent and useful - c, cbind, rbind.
a <- foreach(i=1:4, .combine='c') %do% {i^2}
Go!
What is the most effective way to start a parallel calculation process? Break the task into small independent pieces. This is exactly how parallelization works in R. The foreach and doSNOW module works as a basis.
Let's create a “cluster” for our calculations and run a simple test on it.
# library(foreach) library(doSNOW) cl <- makeCluster(4) registerDoSNOW(cl) a <- mean(foreach(i=1:10^6, .combine='c') %dopar% {mean(rnorm(i))}) stopCluster(cl)
For debugging, you can use the% do% and option in real life -% dopar%.
Important : the code that runs inside a parallel loop should be as independent as possible. For example, he will not be able to see the functions defined in your workspace, so often they are defined either directly inside% dopar%, or they take all the necessary actions into a separate source file, and then connect it via source ('trololo.R').
Clusters doSNOW are different. In essence, a cluster is a collection of R instances that execute code inside% dopar%. All of them can be on the same machine, but can also be separated into different ones.
cl1 <- makeCluster(c("localhost","remotehost"), type = "SOCK")
There are four types of clusters available: PVM (http://www.csm.ornl.gov/pvm/), MPI (http://cran.r-project.org/web/packages/Rmpi/index.html), NWS (http://nws-r.sourceforge.net/) and SOCK. For the first three, additional libraries will be required (in fact, implementations), by default a socket-cluster will be launched.
For debugging and all sorts of different statistics, you can use the snow.time function, wrapping all calls to the cluster in it:
cl <- makeCluster(4) registerDoSNOW(cl) tm <- snow.time(a <- mean(foreach(i=1:10^6, .combine='c')
It turns out about the following pictures:

And what for goat bayan?
Probably, a lot more can be written, but after these examples, I think that everything will be sorted out where to dig.
Why all this may be needed? Yes, a lot of why, at least, we optimize the flow of time, and at the most we can now wait for the end of heavy calculations. I used a similar approach mainly when testing any hypotheses when it was necessary either
- to run the calculation on a very large time period (initial assessment of the efficiency of the exchange strategy)
- run the calculation on a set of similar datasets (RxQ-crossvalidation)
I think that you will be able to use this approach for your own purposes. Good luck!