📜 ⬆️ ⬇️

Fast loading data from files in R

Recently we wrote an application on Shiny, where it was necessary to use a very large block of data (dataframe). This directly affected the launch time of the application, so we had to consider a number of ways to read data from files in R (in our case, these were csv files provided by the customer) and determine the best one.

The purpose of this note is to compare:

  1. read.csv from utils - the standard way to read csv files in R
  2. read_csv from readr , which in RStudio replaced the previous method
  3. load and readRDS from base , and
  4. read_feather from feather and fread from data.table .

Data


First, let's generate some random data:
')
 set.seed(123) df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)), replicate(10, stringi::stri_rand_strings(1000, 5))) 

and save the files to disk to estimate the load time. In addition to the format csv , still need files feather , RDS and Rdata .

 path_csv <- '../assets/data/fast_load/df.csv' path_feather <- '../assets/data/fast_load/df.feather' path_rdata <- '../assets/data/fast_load/df.RData' path_rds <- '../assets/data/fast_load/df.rds' library(feather) library(data.table) write.csv(df, file = path_csv, row.names = F) write_feather(df, path_feather) save(df, file = path_rdata) saveRDS(df, path_rds) 

Now check the file sizes:

 files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds') info <- file.info(files) info$size_mb <- info$size/(1024 * 1024) print(subset(info, select=c("size_mb"))) ## size_mb ## ../assets/data/fast_load/df.csv 1780.3005 ## ../assets/data/fast_load/df.feather 1145.2881 ## ../assets/data/fast_load/df.RData 285.4836 ## ../assets/data/fast_load/df.rds 285.4837 

As you can see, both file formats, and csv , and feather , take up much more disk space. Csv - 6 times, and feather - more than 4 times more RDS and RData .

Performance test


The microbenchmark library was used to compare the reading time in 10 rounds. Methods:


 library(microbenchmark) benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv), readrCSV = readr::read_csv(path_csv, progress = F), fread = data.table::fread(path_csv, showProgress = F), loadRdata = base::load(path_rdata), readRds = base::readRDS(path_rds), readFeather = feather::read_feather(path_feather), times = 10) print(benchmark, signif = 2) ##Unit: seconds ## expr min lq mean median uq max neval ## readCSV 200.0 200.0 211.187125 210.0 220.0 240.0 10 ## readrCSV 27.0 28.0 29.770890 29.0 32.0 33.0 10 ## fread 15.0 16.0 17.250016 17.0 17.0 22.0 10 ## loadRdata 4.4 4.7 5.018918 4.8 5.5 5.9 10 ## readRds 4.6 4.7 5.053674 5.1 5.3 5.6 10 ## readFeather 1.5 1.8 2.988021 3.4 3.6 4.1 10 

And the winner is ... feather ! However, the use of feather involves the preliminary conversion of files into this format.

Using load or readRDS can improve performance (second and third in terms of speed), storing a small / compressed file is also an advantage. In both cases, you first need to convert your file to the appropriate format.

As for reading from the csv format, fread significantly outperforms read_csv and read.csv , and accordingly, is the best option for reading from a csv file.

In our case, we decided to work with the feather file, since the conversion from csv to this format was one-time, and we did not have a strict limit on the size of the files, because we did not consider the RData or RData .

The final sequence of actions was:

  1. reading the csv file provided by the customer using fread ,
  2. write this file to feather via write_feather , and
  3. loading feather files when launching the application using read_feather .

The first two tasks were performed once and outside the application context on Shiny.

There is another interesting benchmark for reading files in R. Unfortunately, if you use the functions mentioned in the article, you will get string type objects, and you will have to do some processing of string data before you can work with the most widely and frequently used dataframe .

Source: https://habr.com/ru/post/326616/


All Articles