A little introduction to parallel programming on R

Let's talk about the use and benefits of parallel computing in R.

The reason why it is worth thinking about it: forcing the computer to work more (perform many calculations at the same time), we wait less time for the results of our experiments and can do more. This is especially important for data analysis (R as a platform is usually used for this purpose), since it is often necessary to repeat variations of the same approach in order to find out something, to derive the values of parameters, to assess the stability of the model.

Usually, in order to make the computer work more, you first need to work for the analyst himself, the programmer or the creator of the library, in order to organize the calculations in a form convenient for parallelization. At best, someone has already done it for you:

Good parallel libraries, such as multi-threaded BLAS / LAPACK, are included in the Revolution R Open (RRO, now Microsoft R Open ) (see here ).
Specialized parallel extensions that provide their own high-performance implementations of important procedures, for example, rx methods from RevoScaleR or h2o methods from h2o.ai.
Abstract parallelization frameworks, for example, Thrust / Rth .
The use of R application libraries related to parallelization (in particular, gbm , boot, and vtreat ). (Some of these libraries do not use parallel operations until an environment for parallel execution is specified.)

In addition to the task prepared for parallelization, you need equipment that will support it. For example:
')

Your own computer. Usually even laptops have four or more cores. The potential advantage of the algorithm is four times faster - huge.
Graphic processors (GPU). Many maschiny have one or more powerful graphics cards. For some computational tasks, these processors are 10-100 times faster than the central processing unit (CPU) commonly used for computing ( details ).
Computer clusters (for example, Amazon ec2, Hadoop server , etc.).

Obviously, parallel computing in R is a vast and highly specialized topic. It may seem impossible to quickly learn this magic — how to make your calculations faster.

In this article, we will show how you can speed up your calculations using the basic capabilities of R.

For starters, there must be a task that can be parallelized. The most obvious tasks of this kind contain repetitive actions (the intuitive term is “naturally parallel”):

Fitting model parameters by reapplying models (as is done in the caret package).
Apply the transformation to a large number of different variables (as is done in the vtreat package).
Evaluation of the quality of the model through cross-validation, bootstrap or other sampling techniques with repetitions.

We will assume that we already have a problem with a large number of simple repetitions. Please note: this concept is not always easily achievable, but such a step is necessary so that the process can begin.

Here is the task that we will use as an example: applying a predictive model for a small data set. Load the dataset and some definitions into the workspace:

d <- iris #  "d"       R   vars <- c('Sepal.Length','Sepal.Width','Petal.Length') yName <- 'Species' yLevels <- sort(unique(as.character(d[[yName]]))) print(yLevels)

 ## [1] "setosa" "versicolor" "virginica"

(We will use the convention that any line starting with " ## " is the output of the result of the previous R. command.)

We are faced with a small modeling problem: the variable we are trying to predict has three levels. The simulation technique we were going to use ( glm(family='binomial') ) does not know how to predict " polynomial results " (although there are libraries designed for this). We decided to approach this task using the one-against-the-other strategy and build a set of classifiers: each will separate one target variable from the others. This task is an obvious candidate for parallelization. Let's turn the function of building one output model to readability:

 fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) }

Then the usual "serial" way to build all the models will look like this:

 for(yLevel in yLevels) { print("*****") print(yLevel) print(fitOneTargetModel(yName,yLevel,vars,d)) }

Or you can wrap our procedure in a single-variable function (this pattern is called curring ) and apply the elegant R- lapply() notation:

 worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } models <- lapply(yLevels,worker) names(models) <- yLevels print(models)

The advantage of the lapply() notation is that it emphasizes the independence of each computation, the kind of isolation that is needed to parallelize our computations. Think of a for loop in the sense that it defines the calculation too accurately, setting an unnecessary order or sequence of operations.

The reorganization of the calculation functionally prepared us for the application of the parallel library and the implementation of the calculation in parallel. First, we deploy a parallel cluster:

 #    parallelCluster <- parallel::makeCluster(parallel::detectCores()) print(parallelCluster)

 ## socket cluster with 4 nodes on host 'localhost'

Please note we have created a "cluster of sockets". A socket cluster is a surprisingly flexible “parallel-distributed” cluster in the first approximation. A cluster of sockets is a rough approximation in the sense that it works relatively slowly (the work will be distributed "inaccurately"), but it is very flexible in implementation: many cores on one machine, many cores on several machines on the same network, on top of other systems, for example MPI cluster (message passing interface - message transfer protocol).

At this point we assume that the code below will work ( here - details on tryCatch ).

 tryCatch( models <- parallel::parLapply(parallelCluster, yLevels,worker), error = function(e) print(e) )

 ## <simpleError in checkForRemoteErrors(val): ## 3 nodes produced errors; first error: ## could not find function "fitOneTargetModel">

Instead of the results, we got the error " could not find function "fitOneTargetModel">. "

Problem: in the cluster of sockets, the arguments parallel::parLapply copied to each processing node via a communication socket. However, the integrity of the current environment (in our case, the so-called “global environment”) is not copied (only values are returned). Therefore, our worker() function, when migrating to parallel nodes, must have another closure (since it cannot point to our execution environment), and it turns out that the new closure no longer contains references to the necessary values of yName , vars , d and fitOneTargetModel . This is sad, but it makes sense. R uses all environments to implement the concept of closures, and R cannot know which values in a given environment will actually require this function.

So we know what is wrong. How to fix it? We will fix this by using an environment other than global to transfer the values we need there. The easiest way to do this is to use your own closure. To achieve this, we wrap the whole process into a function (and we will run it in a controlled environment). The code below works:

 #    ,     mkWorker <- function(yName,vars,d) { # ,      #     force(yName) force(vars) force(d) #   ,    #  worker    fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) } # :      worker. # ""  worker # (    ) - #  / mkWorker, #    ,  . #    #   (    #   ). worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } return(worker) } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker(yName,vars,d)) names(models) <- yLevels print(models)

The code above works because we moved the values we needed to a new execution environment and defined the function we were going to use directly in that environment. Obviously, redefining each function when we need it is cumbersome and expensive (although we could pass it into a wrapper, as was done with other values). A more flexible pattern is this: use the auxiliary function " bindToEnv " to do some of the work. With bindToEnv code looks like this.

 source('bindToEnv.R') #  : http://winvector.imtqy.com/Parallel/bindToEnv.R #    ,     mkWorker <- function() { bindToEnv(objNames=c('yName','vars','d','fitOneTargetModel')) function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker()) names(models) <- yLevels print(models)

The pattern above is laconic and works well. A few reservations to keep in mind:

Remember, every parallel worker is a remote environment. Make sure the necessary libraries are defined on each remote machine.
Non-core libraries loaded into the source environment are not necessarily loaded into the remote ones. It makes sense to use notation with packages, for example, stats::glm() when calling functions from libraries (calling library(...) on each remote node is redundant).
Our bindToEnv function itself changes the environment of the functions passed to it (so that they can refer to the values that we transfer). This may cause additional problems with those environments to which the currying was applied. Here are some ways to get around this problem.

This is worth thinking about. However, I think you decide that adding eight lines of wrapping / stereotypical code is fully worth four or more accelerations.

Also: upon completion of work, do not forget to remove the link to the cluster:

 #    if(!is.null(parallelCluster)) { parallel::stopCluster(parallelCluster) parallelCluster <- c() }

On it we will finish. The next article will discuss how to build clusters of sockets on multiple machines and on Amazon ec2.

The bindToEnv function itself is fairly simple:

 #'           bindTargetEnv. #' #' http://winvector.imtqy.com/Parallel/PExample.html -  . #' #' #'         ,    #' (    ).    #' ,  -worker     #' (    ,    ). #' #' @param bindTargetEnv - ,     #' @param objNames -  ,         #' @param doNotRebind -  ,      bindToEnv <- function(bindTargetEnv=parent.frame(),objNames,doNotRebind=c()) { #     #        for(var in objNames) { val <- get(var,envir=parent.frame()) if(is.function(val) && (!(var %in% doNotRebind))) { #         () environment(val) <- bindTargetEnv } #     ,     assign(var,val,envir=bindTargetEnv) } }

It can also be downloaded from here .

One of the drawbacks of using parallelization in this way is that you may always need another function or a given one. One way around this is to use the R ls() to build a list of names that need to be passed. It is especially efficient to save the results of ls() immediately after source files with functions and important global variables. Without any strategy, adding items to lists is a pain.

For large scale: not very detailed instructions for running multiple machine-R-servers on ec2 can be found here .

Source: https://habr.com/ru/post/307708/

All Articles

A little introduction to parallel programming on R

More articles: