📜 ⬆️ ⬇️

Strategies to speed up code on R, part 1

The for loop in R can be very slow if it is applied in its pure form, without optimization, especially when it comes to dealing with large data sets. There are a number of ways to make your code faster, and you will probably be surprised to know how much.

This article describes several approaches, including simple changes in logic, parallel processing and Rcpp , increasing the speed by several orders of magnitude, so that 100 million rows of data or even more can be processed normally.

Let's try to speed up the code with a for loop and a conditional operator (if-else) to create a column that is added to the data set (data frame, df). The code below creates this initial data set.
 #    col1 <- runif (12^5, 0, 2) col2 <- rnorm (12^5, 0, 2) col3 <- rpois (12^5, 3) col4 <- rchisq (12^5, 2) df <- data.frame (col1, col2, col3, col4) 

In this part: vectorization, only true conditions, ifelse.
In the next part: which, apply, byte-by-byte compilation, Rcpp, data.table.

Logic we are going to optimize


For each row in this dataset (df), check if the sum of the values ​​exceeds 4. If so, the new fifth variable is “greater_than_4”, otherwise - “lesser_than_4”.
 #    R:      system.time({ for (i in 1:nrow(df)) { # for every row if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) { # check if > 4 df[i, 5] <- "greater_than_4" #    5-  } else { df[i, 5] <- "lesser_than_4" #    5-  } } }) 

All subsequent calculations of processing time were carried out on MAC OS X with a 2.6 GHz processor and 8GB of RAM.
')

Vectorize and highlight data structures in advance.


Always initialize your data structures and output variables, setting the required length and data type before starting the calculation loop. Try not to increase the amount of your data step by step inside the loop. Let's compare how vectorization improves speed on different data sizes from 1,000 to 100,000 rows.
 #      output <- character (nrow(df)) #    system.time({ for (i in 1:nrow(df)) { if ((df[i, 'col1'] + df[i, 'col2'] + df[i, 'col3'] + df[i, 'col4']) > 4) { output[i] <- "greater_than_4" } else { output[i] <- "lesser_than_4" } } df$output}) 


Source code and vectorized code

Remove conditional statements outside the loop.


Carrying out conditional checks beyond the bounds of a cycle is comparable in terms of gain with vectorization itself. Tests were conducted on ranges from 100,000 to 1,000,000 lines. The speed win is huge again.
 #     ,       output <- character (nrow(df)) condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4 #     system.time({ for (i in 1:nrow(df)) { if (condition[i]) { output[i] <- "greater_than_4" } else { output[i] <- "lesser_than_4" } } df$output <- output }) 


Check condition out of cycle

Run the loop for true conditions only.


Another optimization that can be used here is to start the loop only by true conditions, after having previously initialized the output vector with False values. Acceleration here is highly dependent on the number of cases with True in your data.

The tests compare the performance of this and previous improvements on data from 1,000,000 to 10,000,000 rows. Notice the increase in the number of zeros here. As expected, there is a very definite noticeable improvement.
 output <- character(nrow(df)) condition <- (df$col1 + df$col2 + df$col3 + df$col4) > 4 system.time({ for (i in (1:nrow(df))[condition]) { #       if (condition[i]) { output[i] <- "greater_than_4" } else { output[i] <- "lesser_than_4" } } df$output }) 


Run the loop only on true conditions.

Use ifelse () where possible


This logic can be made much faster and easier using ifelse() . The syntax is similar to the if function in MS Excel, but the acceleration is phenomenal, especially since there is no preallocation, and the condition is checked every time. This seems like a very profitable way to speed up the execution of simple loops.
 system.time({ output <- ifelse ((df$col1 + df$col2 + df$col3 + df$col4) > 4, "greater_than_4", "lesser_than_4") df$output <- output }) 


Only true conditions and ifelse

Source: https://habr.com/ru/post/277681/


All Articles