R Development: Cycle Mysteries

Less than a week ago, in the Hacker magazine, the author's version of the material devoted to features using cycles when developing on R. was published . By agreement with Hacker, we share the full version of the first article . You will learn how to write cycles when processing large amounts of data.

Note: the author’s spelling and punctuation is preserved.

In many programming languages, cycles serve as basic building blocks that are used for any repetitive tasks. However, in R, excessive or incorrect use of cycles can lead to a perceptible drop in performance — despite the fact that there are an unusually large number of ways to write cycles in this language.

Today we will consider the features of using regular cycles in R, as well as get acquainted with the foreach function from the package of the same name, which offers an alternative approach to this seemingly basic task. On the one hand, foreach combines the best of standard functionality, on the other hand, it allows you to easily switch from sequential to parallel computations with minimal changes in the code.

About cycles

To begin with, it often turns out to be an unpleasant surprise for those who switch to R from classical programming languages: if we want to write a loop, then we have to think about it for a second. The fact is that in languages for working with large amounts of data, cycles tend to be inferior in efficiency to specialized query functions, filtering, aggregation and data transformation. This is easy to remember using the example of databases, where most of the operations are performed using the SQL query language, rather than using cycles.

To understand how important this rule is, let's turn to numbers. Suppose we have a very simple table of two columns a and b . The first grows from 1 to 100 000, the second decreases from 100 000 to 1:

 testDF <- data.frame(a = 1:100000, b = 100000:1)

If we want to calculate the third column, which will be the sum of the first two, then you will be surprised how many novice R-developers can write code like this:

 for(row in 1:nrow(testDF)) testDF[row, 3] <- testDF[row, 1] + testDF[row, 2] # !

On my laptop, calculations take 39 seconds, although the same result can be achieved in 0.009 seconds by using the function for working with tables from the dplyr package:

 testDF <- testDF %>% mutate(c = a + b)

The main reason for such a serious difference in speed is the loss of time when reading and writing cells in the table. It is thanks to the optimization at these stages that special functions benefit. But it is not necessary to write off good old cycles for scrap, because without them it is still impossible to create a full program. Let's see what's up with the cycles in R.

Classic loops

R supports the basic classic ways of writing loops:

for - the most common type of cycles. The syntax is very simple and familiar to developers in various programming languages. We tried to use it at the very beginning of the article. for performs the function passed to it for each element.
```
 #    1  10 for(i in 1:10) print(i) #      strings strings <- c("", "", "") for(str in strings) print(str) 
```
Slightly less common while and repeat , which are also often found in other programming languages. In while before each iteration, a logical condition is checked, and if it is observed, then the loop is iterated; if not, the loop ends:
```
 while(cond) expr 
```

In repeat loop repeats until the break statement is called explicitly:

  repeat expr

It is worth noting that for , while and repeat always return NULL - and this is where they differ from the next group of loops.

Apply based loops

apply , eapply , lapply , mapply , rapply , sapply , tapply , vapply - a fairly large list of function loops, united by one idea. They differ in what the cycle is applied to and what returns.

Let's start with the basic apply that applies to matrices:

 apply(X, MARGIN, FUN, ...)

In the first parameter ( X ) we specify the source matrix, in the second parameter ( MARGIN ) we specify the way to bypass the matrix (1 - in rows, 2 - in columns, with (1,2) - in rows and columns), the third parameter is the function FUN, which will be called for each item. The results of all calls will be combined into one vector or matrix, which the function apply will return as the resulting value.

For example, create a matrix m size 3 x 3.

 m <- matrix(1:9, nrow = 3, ncol = 3) print(m) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

Let's try the function apply in action.

 apply(m, MARGIN = 1, FUN = sum) #      [1] 12 15 18 apply(m, MARGIN = 2, FUN = sum) #      [1] 6 15 24

For simplicity, I passed the existing function sum to apply , but you can use your functions - in fact, therefore apply is a complete implementation of the loop. For example, we replace the sum with our function, which first performs the summation and, if the sum is equal to 15, replaces the return value by 100.

 apply(m, MARGIN = 1, #       FUN = function(x) #       apply { s <- sum(x) #   if (s == 15) #    15,     100 s <- 100 (s) } ) [1] 12 100 18

Another common feature of this family is lapply .

 lapply(X, FUN, ...)

The first parameter is a list or vector, and the second is the function to be called for each element. The sapply and vapply functions are wrappers around lapply . The first tries to bring the result to a vector, matrix, or array. The second adds a check of return types.

Such a way of using sapply as working with columns is quite common. For example, we have a table

 data <- data.frame(co1_num = 1, col2_num = 2, col3_char = "a", col4_char = "b")

When transferring a sapply table, it will be treated as a list of columns (vectors). Therefore, by applying sapply to our data.frame and specifying data.frame as the called function, we will check which columns are numeric.

 sapply(data, is.numeric) co1_num col2_num col3_char col4_char TRUE TRUE FALSE FALSE

We display only columns with numeric values:

 data[,sapply(data, is.numeric)] co1_num col2_num 1 1 2

The cycles based on apply are different from the classic ones in that the result of the cycle operation is returned, consisting of the results of each iteration.

Remember that slow loop we wrote at the very beginning with for ? Most of the time was lost because at each iteration the results were recorded in a table. We write an optimized version using apply .

Apply apply to the original table, selecting the processing line by line, and as the function used, we specify the basic sum function sum . As a result, apply will return a vector, where for each row will be the sum of its columns. Add this vector as a new column to the original table and get the desired result:

 a_plus_b <- apply(testDF, 1,sum) testDF$c <- a_plus_b

Measurement of execution time shows 0.248 seconds, which is one hundred times faster than the first option, but still ten times slower than the functions of operations with tables.

foreach

foreach is not a basic R function. The corresponding package must be installed, and before the call to connect:

 install.packages("foreach") #     ( ) library(foreach) #

Despite the fact that foreach is a third-party function, today it is a very popular approach to writing loops. foreach was developed by one of the most respected R companies in the world - Revolution Analytics , which created its commercial distribution kit R. In 2015, the company was bought by Microsoft, and now all its developments are included in Microsoft SQL Server R Services. However, foreach is an ordinary open source project under the Apache License 2.0 license.

The main reasons for the popularity of foreach :

the syntax is similar to for - as I said, the most popular type of loop;
foreach returns values that are collected from the results of each iteration, while you can define your function and implement any logic to collect the final value of the cycle from the results of iterations;
It is possible to use multithreading and run iterations in parallel.

Let's start c simple. For numbers from 1 to 10 at each iteration, the number is multiplied by 2. The results of all iterations are recorded in the result variable as a list:

 result <- foreach(i = 1:10) %do% (i*2)

If we want the result to be not a list, but a vector, then we need to specify c as a function to combine the results:

 result <- foreach(i = 1:10, .combine = "c") %do% (i*2)

You can even simply add up all the results by combining them with the + operator, and then the number 110 will simply be written into the result variable:

 result <- foreach(i = 1:10, .combine = "+") %do% (i*2)

In this case, you can simultaneously specify several variables in the foreach for traversal. Let the variable a grow from 1 to 10, and b decrease from 10 to 1. Then we get the result vector of 10 numbers 11:

 result <- foreach(a = 1:10, b = 10:1, .combine = "c") %do% (a+b)

Loop iterations can return more than simple values. Suppose we have a function that returns data.frame :

 customFun <- function(param) { data.frame(param = param, result1 = sample(1:100, 1), result2 = sample(1:100, 1)) }

If we want to call this function a hundred times and combine the results into one data.frame , then in .combine, we can use the rbind function to rbind :

 result <- foreach(param = 1:100,.combine = "rbind") %do% customFun(param)

As a result, in the result variable, we have a single table of results.

In .combine it is also possible to use your own function, and with the help of additional parameters you can optimize performance if your function can take more than two parameters at once (in the foreach documentation there is a description of the parameters .multicombine and .maxcombine ).

One of the main advantages of foreach is the ease of transition from sequential processing to parallel processing. In fact, this transition is carried out by replacing %do% with %dopar% , but there are a few nuances:

Before calling foreach , you should already have a parallel backend registered. R has several popular implementations of parallel backend doParallel , doSNOW , doMC , and each has its own characteristics, but for the sake of simplicity, I suggest choosing the first one and writing a few lines of code to connect it:
```
 library(doParallel) #     cl <- makeCluster(8) #  «»    registerDoParallel(cl) #  «» 
```

If we now call a cycle of eight iterations, each of which just waits for one second, it will be seen that the cycle will work in one second, since all iterations will be run in parallel:

  system.time({ foreach(i=1:8) %dopar% Sys.sleep(1) # }) user system elapsed 0.008 0.005 1.014

After using parallel backend, you can stop:

  stopCluster(cl)

There is no need to create and delete the parallel backend each time before foreach . As a rule, it is created once in a program and is used by all functions that can work with it.

You need to explicitly specify which packages to load into the worker threads using the .packages parameter.

For example, you want to create a file at each iteration using the readr package that was loaded into memory before calling foreach . In the case of a sequential cycle ( %do% ), everything will work without errors:

  library(readr) foreach(i=1:8) %do% write_csv(data.frame(id = 1), paste0("file", i, ".csv"))      (`%dopar%`)    : library(readr) foreach(i=1:8) %do% write_csv(data.frame(id = 1), paste0("file", i, ".csv")) Error in write_csv(data.frame(id = 1), paste0("file", i, ".csv")) : task 1 failed - "could not find function "write_csv""

The error occurs because the readr package is not loaded inside the parallel stream. Fix this error with the .packages parameter:

  foreach(i=1:8, .packages = "readr") %dopar% write_csv(data.frame(id = 1), paste0("file", i, ".csv"))

The console output in a parallel stream is not displayed on the screen. Sometimes it can be great to complicate debugging, so usually complex code is first written without parallelism, and then it is replaced with %do% by %dopar% or it redirects the output of each iteration to its file using the sink function.

Instead of conclusions

When working with large amounts of data cycles are not always the best choice. The use of specialized functions for sampling, aggregation and transformation of data is always more efficient than cycles.
R offers many options for implementing loops. The main difference between the classical for , while and repeat functions from the apply based function group is that the latter return a value.
Using foreach loops from the same external package allows simplifying the writing of loops, flexibly operating the values returned by iterations, and by connecting multi-threaded processing it is also great to increase the solution's performance.

Www

Source: https://habr.com/ru/post/320232/

All Articles