d <- iris # "d" R vars <- c('Sepal.Length','Sepal.Width','Petal.Length') yName <- 'Species' yLevels <- sort(unique(as.character(d[[yName]]))) print(yLevels)
## [1] "setosa" "versicolor" "virginica"
glm(family='binomial')
) does not know how to predict " polynomial results " (although there are libraries designed for this). We decided to approach this task using the one-against-the-other strategy and build a set of classifiers: each will separate one target variable from the others. This task is an obvious candidate for parallelization. Let's turn the function of building one output model to readability: fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) }
for(yLevel in yLevels) { print("*****") print(yLevel) print(fitOneTargetModel(yName,yLevel,vars,d)) }
lapply()
notation: worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } models <- lapply(yLevels,worker) names(models) <- yLevels print(models)
lapply()
notation is that it emphasizes the independence of each computation, the kind of isolation that is needed to parallelize our computations. Think of a for loop in the sense that it defines the calculation too accurately, setting an unnecessary order or sequence of operations. # parallelCluster <- parallel::makeCluster(parallel::detectCores()) print(parallelCluster)
## socket cluster with 4 nodes on host 'localhost'
tryCatch
). tryCatch( models <- parallel::parLapply(parallelCluster, yLevels,worker), error = function(e) print(e) )
## <simpleError in checkForRemoteErrors(val): ## 3 nodes produced errors; first error: ## could not find function "fitOneTargetModel">
could not find function "fitOneTargetModel">.
"parallel::parLapply
copied to each processing node via a communication socket. However, the integrity of the current environment (in our case, the so-called “global environment”) is not copied (only values ​​are returned). Therefore, our worker()
function, when migrating to parallel nodes, must have another closure (since it cannot point to our execution environment), and it turns out that the new closure no longer contains references to the necessary values ​​of yName
, vars
, d
and fitOneTargetModel
. This is sad, but it makes sense. R uses all environments to implement the concept of closures, and R cannot know which values ​​in a given environment will actually require this function. # , mkWorker <- function(yName,vars,d) { # , # force(yName) force(vars) force(d) # , # worker fitOneTargetModel <- function(yName,yLevel,vars,data) { formula <- paste('(',yName,'=="',yLevel,'") ~ ', paste(vars,collapse=' + '),sep='') glm(as.formula(formula),family=binomial,data=data) } # : worker. # "" worker # ( ) - # / mkWorker, # , . # # ( # ). worker <- function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } return(worker) } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker(yName,vars,d)) names(models) <- yLevels print(models)
bindToEnv
" to do some of the work. With bindToEnv
code looks like this. source('bindToEnv.R') # : http://winvector.imtqy.com/Parallel/bindToEnv.R # , mkWorker <- function() { bindToEnv(objNames=c('yName','vars','d','fitOneTargetModel')) function(yLevel) { fitOneTargetModel(yName,yLevel,vars,d) } } models <- parallel::parLapply(parallelCluster,yLevels, mkWorker()) names(models) <- yLevels print(models)
stats::glm()
when calling functions from libraries (calling library(...)
on each remote node is redundant).bindToEnv
function itself changes the environment of the functions passed to it (so that they can refer to the values ​​that we transfer). This may cause additional problems with those environments to which the currying was applied. Here are some ways to get around this problem. # if(!is.null(parallelCluster)) { parallel::stopCluster(parallelCluster) parallelCluster <- c() }
bindToEnv
function itself is fairly simple: #' bindTargetEnv. #' #' http://winvector.imtqy.com/Parallel/PExample.html - . #' #' #' , #' ( ). #' , -worker #' ( , ). #' #' @param bindTargetEnv - , #' @param objNames - , #' @param doNotRebind - , bindToEnv <- function(bindTargetEnv=parent.frame(),objNames,doNotRebind=c()) { # # for(var in objNames) { val <- get(var,envir=parent.frame()) if(is.function(val) && (!(var %in% doNotRebind))) { # () environment(val) <- bindTargetEnv } # , assign(var,val,envir=bindTargetEnv) } }
ls()
to build a list of names that need to be passed. It is especially efficient to save the results of ls()
immediately after source files with functions and important global variables. Without any strategy, adding items to lists is a pain.Source: https://habr.com/ru/post/307708/
All Articles