Less than a week ago, in the Hacker magazine, the author's version of the material devoted to features using cycles when developing on R. was published . By agreement with Hacker, we share the full version of the first article . You will learn how to write cycles when processing large amounts of data.
Note: the author’s spelling and punctuation is preserved.
In many programming languages, cycles serve as basic building blocks that are used for any repetitive tasks. However, in R, excessive or incorrect use of cycles can lead to a perceptible drop in performance — despite the fact that there are an unusually large number of ways to write cycles in this language.
Today we will consider the features of using regular cycles in R, as well as get acquainted with the foreach
function from the package of the same name, which offers an alternative approach to this seemingly basic task. On the one hand, foreach
combines the best of standard functionality, on the other hand, it allows you to easily switch from sequential to parallel computations with minimal changes in the code.
To begin with, it often turns out to be an unpleasant surprise for those who switch to R from classical programming languages: if we want to write a loop, then we have to think about it for a second. The fact is that in languages ​​for working with large amounts of data, cycles tend to be inferior in efficiency to specialized query functions, filtering, aggregation and data transformation. This is easy to remember using the example of databases, where most of the operations are performed using the SQL query language, rather than using cycles.
To understand how important this rule is, let's turn to numbers. Suppose we have a very simple table of two columns a
and b
. The first grows from 1 to 100 000, the second decreases from 100 000 to 1:
testDF <- data.frame(a = 1:100000, b = 100000:1)
If we want to calculate the third column, which will be the sum of the first two, then you will be surprised how many novice R-developers can write code like this:
for(row in 1:nrow(testDF)) testDF[row, 3] <- testDF[row, 1] + testDF[row, 2] # !
On my laptop, calculations take 39 seconds, although the same result can be achieved in 0.009 seconds by using the function for working with tables from the dplyr
package:
testDF <- testDF %>% mutate(c = a + b)
The main reason for such a serious difference in speed is the loss of time when reading and writing cells in the table. It is thanks to the optimization at these stages that special functions benefit. But it is not necessary to write off good old cycles for scrap, because without them it is still impossible to create a full program. Let's see what's up with the cycles in R.
R supports the basic classic ways of writing loops:
for
- the most common type of cycles. The syntax is very simple and familiar to developers in various programming languages. We tried to use it at the very beginning of the article. for
performs the function passed to it for each element.
# 1 10 for(i in 1:10) print(i) # strings strings <- c("", "", "") for(str in strings) print(str)
Slightly less common while
and repeat
, which are also often found in other programming languages. In while
before each iteration, a logical condition is checked, and if it is observed, then the loop is iterated; if not, the loop ends:
while(cond) expr
In repeat
loop repeats until the break
statement is called explicitly:
repeat expr
It is worth noting that for
, while
and repeat
always return NULL - and this is where they differ from the next group of loops.
apply
, eapply
, lapply
, mapply
, rapply
, sapply
, tapply
, vapply
- a fairly large list of function loops, united by one idea. They differ in what the cycle is applied to and what returns.
Let's start with the basic apply
that applies to matrices:
apply(X, MARGIN, FUN, ...)
In the first parameter ( X
) we specify the source matrix, in the second parameter ( MARGIN
) we specify the way to bypass the matrix (1 - in rows, 2 - in columns, with (1,2) - in rows and columns), the third parameter is the function FUN, which will be called for each item. The results of all calls will be combined into one vector or matrix, which the function apply
will return as the resulting value.
For example, create a matrix m
size 3 x 3.
m <- matrix(1:9, nrow = 3, ncol = 3) print(m) [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9
Let's try the function apply
in action.
apply(m, MARGIN = 1, FUN = sum) # [1] 12 15 18 apply(m, MARGIN = 2, FUN = sum) # [1] 6 15 24
For simplicity, I passed the existing function sum
to apply
, but you can use your functions - in fact, therefore apply
is a complete implementation of the loop. For example, we replace the sum with our function, which first performs the summation and, if the sum is equal to 15, replaces the return value by 100.
apply(m, MARGIN = 1, # FUN = function(x) # apply { s <- sum(x) # if (s == 15) # 15, 100 s <- 100 (s) } ) [1] 12 100 18
Another common feature of this family is lapply
.
lapply(X, FUN, ...)
The first parameter is a list or vector, and the second is the function to be called for each element. The sapply
and vapply
functions are wrappers around lapply
. The first tries to bring the result to a vector, matrix, or array. The second adds a check of return types.
Such a way of using sapply
as working with columns is quite common. For example, we have a table
data <- data.frame(co1_num = 1, col2_num = 2, col3_char = "a", col4_char = "b")
When transferring a sapply
table, it will be treated as a list of columns (vectors). Therefore, by applying sapply
to our data.frame
and specifying data.frame
as the called function, we will check which columns are numeric.
sapply(data, is.numeric) co1_num col2_num col3_char col4_char TRUE TRUE FALSE FALSE
We display only columns with numeric values:
data[,sapply(data, is.numeric)] co1_num col2_num 1 1 2
The cycles based on apply
are different from the classic ones in that the result of the cycle operation is returned, consisting of the results of each iteration.
Remember that slow loop we wrote at the very beginning with for
? Most of the time was lost because at each iteration the results were recorded in a table. We write an optimized version using apply
.
Apply apply
to the original table, selecting the processing line by line, and as the function used, we specify the basic sum function sum
. As a result, apply
will return a vector, where for each row will be the sum of its columns. Add this vector as a new column to the original table and get the desired result:
a_plus_b <- apply(testDF, 1,sum) testDF$c <- a_plus_b
Measurement of execution time shows 0.248 seconds, which is one hundred times faster than the first option, but still ten times slower than the functions of operations with tables.
foreach
is not a basic R function. The corresponding package must be installed, and before the call to connect:
install.packages("foreach") # ( ) library(foreach) #
Despite the fact that foreach
is a third-party function, today it is a very popular approach to writing loops. foreach
was developed by one of the most respected R companies in the world - Revolution Analytics , which created its commercial distribution kit R. In 2015, the company was bought by Microsoft, and now all its developments are included in Microsoft SQL Server R Services. However, foreach
is an ordinary open source project under the Apache License 2.0 license.
The main reasons for the popularity of foreach
:
for
- as I said, the most popular type of loop;foreach
returns values ​​that are collected from the results of each iteration, while you can define your function and implement any logic to collect the final value of the cycle from the results of iterations;Let's start c simple. For numbers from 1 to 10 at each iteration, the number is multiplied by 2. The results of all iterations are recorded in the result variable as a list:
result <- foreach(i = 1:10) %do% (i*2)
If we want the result to be not a list, but a vector, then we need to specify c
as a function to combine the results:
result <- foreach(i = 1:10, .combine = "c") %do% (i*2)
You can even simply add up all the results by combining them with the +
operator, and then the number 110 will simply be written into the result
variable:
result <- foreach(i = 1:10, .combine = "+") %do% (i*2)
In this case, you can simultaneously specify several variables in the foreach
for traversal. Let the variable a
grow from 1 to 10, and b
decrease from 10 to 1. Then we get the result
vector of 10 numbers 11:
result <- foreach(a = 1:10, b = 10:1, .combine = "c") %do% (a+b)
Loop iterations can return more than simple values. Suppose we have a function that returns data.frame
:
customFun <- function(param) { data.frame(param = param, result1 = sample(1:100, 1), result2 = sample(1:100, 1)) }
If we want to call this function a hundred times and combine the results into one data.frame
, then in .combine, we can use the rbind
function to rbind
:
result <- foreach(param = 1:100,.combine = "rbind") %do% customFun(param)
As a result, in the result
variable, we have a single table of results.
In .combine
it is also possible to use your own function, and with the help of additional parameters you can optimize performance if your function can take more than two parameters at once (in the foreach
documentation there is a description of the parameters .multicombine
and .maxcombine
).
One of the main advantages of foreach
is the ease of transition from sequential processing to parallel processing. In fact, this transition is carried out by replacing %do%
with %dopar%
, but there are a few nuances:
Before calling foreach
, you should already have a parallel backend registered. R has several popular implementations of parallel backend doParallel
, doSNOW
, doMC
, and each has its own characteristics, but for the sake of simplicity, I suggest choosing the first one and writing a few lines of code to connect it:
library(doParallel) # cl <- makeCluster(8) # «» registerDoParallel(cl) # «»
If we now call a cycle of eight iterations, each of which just waits for one second, it will be seen that the cycle will work in one second, since all iterations will be run in parallel:
system.time({ foreach(i=1:8) %dopar% Sys.sleep(1) # }) user system elapsed 0.008 0.005 1.014
After using parallel backend, you can stop:
stopCluster(cl)
There is no need to create and delete the parallel backend each time before foreach
. As a rule, it is created once in a program and is used by all functions that can work with it.
.packages
parameter.For example, you want to create a file at each iteration using the readr
package that was loaded into memory before calling foreach
. In the case of a sequential cycle ( %do%
), everything will work without errors:
library(readr) foreach(i=1:8) %do% write_csv(data.frame(id = 1), paste0("file", i, ".csv")) (`%dopar%`) : library(readr) foreach(i=1:8) %do% write_csv(data.frame(id = 1), paste0("file", i, ".csv")) Error in write_csv(data.frame(id = 1), paste0("file", i, ".csv")) : task 1 failed - "could not find function "write_csv""
The error occurs because the readr
package is not loaded inside the parallel stream. Fix this error with the .packages
parameter:
foreach(i=1:8, .packages = "readr") %dopar% write_csv(data.frame(id = 1), paste0("file", i, ".csv"))
%do%
by %dopar%
or it redirects the output of each iteration to its file using the sink
function.When working with large amounts of data cycles are not always the best choice. The use of specialized functions for sampling, aggregation and transformation of data is always more efficient than cycles.
R offers many options for implementing loops. The main difference between the classical for
, while
and repeat
functions from the apply
based function group is that the latter return a value.
foreach
loops from the same external package allows simplifying the writing of loops, flexibly operating the values ​​returned by iterations, and by connecting multi-threaded processing it is also great to increase the solution's performance.Source: https://habr.com/ru/post/320232/
All Articles