Under exclusive conditions, we present you the full version of an article from the Hacker magazine devoted to the development on R. Under the cat you will learn how to squeeze the maximum speed when working with tabular data in R.
Note: the author’s spelling and punctuation is preserved.
What extra words? You read the article about speed, so let's get to the point right away! If you are working with a large amount of data in the project and more time is spent on transforming the tables than you would like, data.table
will help solve this problem. The article will be interesting to those who are already a little familiar with the R language, as well as to developers who actively use it, but have not yet discovered the data.table
package.
Everything you need for our today's article can be installed using the appropriate functions:
install.packages("data.table") install.packages("dplyr") install.packages("readr")
In recent years, the R language is deservedly gaining popularity in machine learning environments. As a rule, to work with this subsection of artificial intelligence, it is necessary to load data from several sources, carry out transformations with them to obtain a training sample, create a model on its basis, and then use this model for predictions.
In words, everything is simple, but in real life, many attempts are needed to form a “good” and sustainable model, most of which can be absolutely dead-end. The R language helps to simplify the process of creating such a model, as it is an effective tool for analyzing tabular data. To work with them in R there is a built-in data type of data.frame
and a huge number of algorithms and models that actively use it. In addition, the whole power of R is the ability to extend basic functionality with third-party packages. At the time of writing, their number in the official repository reached 8914 .
But as they say, there is no limit to perfection. A large number of packages make it easier to work with the data.frame
data type data.frame
. Usually their goal is to simplify the syntax for performing the most common tasks. It is impossible not to recall the dplyr
package, which has already become the de facto standard for working with data.frame
, since due to it readability and usability of working with tables have increased significantly.
Let's data.frame
from theory to practice and create a data.frame
DF
with columns a
, b
and .
DF <- data.frame(a=sample(1:10, 100, replace = TRUE), # 1 10 b=sample(1:5, 100, replace = TRUE), # 1 5 c=100:1) # 100 1
If we want:
a
and
,a
= 2 and
> 10,a
, equal to the sum of a
and
,DF2
,The basic syntax on pure data.frame
will be:
DF2 <- DF[DF$a == 2 & DF$c > 10, c("a", "c")] # DF2$ac <- DF2$a + DF2$c #
With dplyr
everything is much dplyr
:
library(dplyr) # dplyr DF2 <- DF %>% select(a, c) %>% filter(a == 2, c > 10) %>% mutate(ac = a + c)
The same steps, but with comments:
DF2 <- # , DF2 DF %>% # DF (%>%) select(a, c) %>% # «a» «» (%>%) filter(a == 2, c > 10) %>% # (%>%) mutate(ac = a + c) # «ac», «» «»
There is an alternative approach to working with tables - data.table
. Formally, data.table
is also data.frame
, and it can be used with existing functions and packages, which often do not know anything about data.table
and work exclusively with data.frame
. This “improved” data.frame
can perform many typical tasks several times faster than its progenitor. There is a legitimate question: where is the catch? This very “ambush” in data.table
is its syntax, which is very different from the original. Moreover, if dplyr
makes the code easier to understand from the very first seconds of use, data.table
turns the code into black magic, and only years of studying witch books A few days of practice with data.table
will allow data.table
to fully understand the idea of ​​a new syntax and the principle of simplifying code.
To work with data.table
you need to connect its package.
library(data.table) #
In further examples, these calls will be omitted and it will be assumed that the package is already loaded.
Since data is very often loaded from CSV files, data.table
can be surprising at this stage. In order to show more measurable estimates, take some fairly large CSV file. As an example, one can cite data from one of the last competitions on Kaggle . There you will find a 1.27 GB training CSV file . The file structure is very simple:
row_id
- event identifier;x
, y
- coordinates;accuracy
- accuracy;time
- time;place_id
is the organization identifier.Let's try to use the basic R - read.csv
and measure the time it read.csv
to download this file (for this we turn to the system.time
function):
system.time( train_DF <- read.csv("train.csv") )
Runtime - 461.349 seconds. Enough to go for coffee ... Even if you don’t want to use data.table
in the future, still try to use the built-in CSV reading functions less often. There is a good readr
library, where everything is implemented much more efficiently than in basic functions. Let's look at her work on the example and connect the package.
library(readr)
Next, we use the function of loading data from CSV:
system.time( train_DF <- read_csv("train.csv") )
Runtime - 38.067 seconds - much faster than the previous result! Let's see what data.table is capable of:
system.time( train_DT <- fread("train.csv") )
The execution time is 20.906 seconds, which is almost two times faster than in readr
, and twenty times faster than in the base method.
In our example, the difference in download speed for different methods was quite large. Inside each of the methods used, the time linearly depends on the file size, but the difference in speed between these methods strongly depends on the file structure (number and types of columns). Below are the test measurements of file download times.
For a file of three text columns, you can see a clear advantage of fread
:
If, however, not textual, but digital columns are read, the difference between fread
and read_csv
less noticeable:
If after loading data from a file you are going to continue to work with data.table
, then fread
returns it immediately. With other ways of loading data, it will be necessary to make data.table
from data.frame
, although it is simple:
train_DF # data.frame train_DT <- data.table(train_DF) # data.table `data.frame`
Most speed optimizations in data.table
achieved by working with objects by reference, additional copies of objects are not created in memory, which means time and resources are saved.
For example, the same task to create data.table
from data.frame
could be solved by one command to “pump up”, but we must remember that the initial value of the variable will be lost.
train_DF # data.frame setDT(train_DF) # DF data.table
So, we downloaded the data, it's time to work with them. We assume that the variable DT
already has a loaded data.table
. The authors of the package use the following designation of the main DT[i, j, by]
blocks:
DT
;Let us recall the very first example where we used the data.frame
DF
, and on it we will test various blocks. Start by creating a data.table
from data.frame
:
DT <- data.table(DF) # data.table data.frame
This is the most understandable of the blocks. It serves to filter the data.table
rows, and if nothing else is required, you can leave out the remaining blocks.
DT[a == 2] # a == 2 DT[a == 2 & c > 10] # a == 2 c > 10
This block processes the contents of data.table
with filtered rows. You can simply ask to return the necessary columns by specifying them in the list
. For convenience, the synonym list
introduced in the form of a point (that is, list (a, b)
equivalent to .(a, b)
). All existing columns in data.table
are available as “variables” - you do not need to work with them as with rows, and you can use intellisense.
DT[, list(a, c)] # «» «» DT[, .(a, c)] #
You can also specify the additional columns you want to create and assign the required values ​​to them:
DT[, .(a, c, ac = a+c)]
If all this is combined, you can perform the first task, which we tried to solve in different ways:
DT2 <- DT[a == 2 & c > 10, .(a, c, ac = a + c),]
The choice of columns is only a part of the j block capabilities. Also there you can change the existing data.table
. For example, if we want to add a new column in an existing data.table
, and not in a new copy (as in the previous example), this can be done using the special syntax :=
.
DT2[, ac_mult2 := ac * 2] # DT2 ac_mult2 = ac * 2
Using the same operator, you can delete columns by assigning them to NULL
.
DT2[, ac_mult2 := NULL] # DT2 ac_mult2
Working with resources on the link is great, it saves power, and it is much faster, since we avoid creating a copy of the same tables with different columns. But we must understand that the change by reference changes the object itself. If you need a copy of this data in another variable, then you must explicitly indicate that it is a separate copy, and not a link to the same object.
Consider an example:
DT3 <- DT2 DT3[, ac_mult2 := ac * 2] #
It may seem that we have changed only DT3
, but DT2
and DT3
are one object, and, turning to DT2
, we will see a new column there. This concerns not only the deletion and creation of columns, since data.table
uses references, including sorting. So the call to setorder(DT3, "a")
will also affect DT2
.
To create a copy, you can use the function:
DT3 <- copy(DT2) DT3[, ac_mult2 := NULL]
Now DT2
and DT3
are different objects, and we deleted the column from DT3
.
This block groups data like group_by
from the dplyr
or GROUP BY
package in the SQL query language. The logic for accessing data.table
with grouping is as follows:
data.table
.The block is filled in the following way: by=list( )
, but, as in block j, list
can be replaced by a point, that is, by=list(a, b)
equivalent to by=.(a, b)
. If you need to group only one field, you can omit the use of the list and write directly by=a
:
DT[,.(max = max(c)), by=.(a,b)] # «a» «b» «max» «» DT[,.(max = max(c)), by=a] # «a» «max» «»
The most common mistake of those who learn to work with data.table
is the use of the usual data.frame
constructs for data.table
. This is a very sore point, and you can spend a lot of time searching for errors. If we have exactly the same data in the DF2
( data.frame
) and DT2
( data.table
) variables, these calls will return completely different values:
DF2[1:5,1:2] ## ac ## 1 2 95 ## 2 2 94 ## 3 2 92 ## 4 2 80 ## 5 2 65 DT2[1:5,1:2] ## [1] 1 2
The reason for this is very simple:
data.frame
logic is data.frame
follows - DF2[1:5,1:2]
means that you need to take the first five rows and return the values ​​of the first two columns for them;data.table
logic data.table
different - DT2[1:5,1:2]
means that you need to take the first five lines and transfer them to block j. Block j will simply return 1
and 2
.If you need to access data.table
in the data.frame
format, you must explicitly specify this using an additional parameter:
DT2[1:5,1:2, with = FALSE] ## ac ## 1: 2 95 ## 2: 2 94 ## 3: 2 92 ## 4: 2 80 ## 5: 2 65
Let's make sure that learning this syntax makes sense. Let's go back to the example with a large CSV file. train_DF
loaded data.frame
, and train_DF
loaded into train_DT
, respectively.
In the example used, place_id
is a long integer number ( integer64
), but only fread
"guessed" this. The rest of the methods loaded this field as a floating point number, and we will need to explicitly convert the place_id
field inside train_DF
to compare speeds.
install.packages("bit64") # integer64 library(bit64) train_DF$place_id <- as.integer64(train_DF$place_id)
Suppose we have the task of place_id
number of references to each place_id
in the data.
In dplyr
with plain data.frame
it took 13.751 seconds:
count <- train_DF %>% # train_DF group_by(place_id) %>% # place_id summarise(length(place_id)) #
At the same time, data.table
does the same in 2.578 seconds:
system.time( count2 <- train_DT[,.(.N), by = place_id] # .N - , )
Let's complicate the task - for all place_id
count the number, the median of x
and y
, and then sort by the number in the reverse order. data.frame
c dplyr
it in 27.386 seconds:
system.time( count <- train_DF %>% # train_DF group_by(place_id) %>% # place_id summarise(count = length(place_id), # mx = median(x), # x my = median(y)) %>% # y arrange(-count) # count )
data.table
managed much faster - 12.414 seconds:
system.time( count2 <- train_DT[,.(count=.N, mx = median(x), my = median(y)), by = place_id][order(-count)] )
Test run times for simple data grouping with dplyr
and data.table
:
This is only a superficial description of the data.table
functionality, but it’s enough to start using this package. The dtplyr
package is being dtplyr
, which is positioned as a dplyr
implementation for data.table
, but so far it is still very young (version 0.0.1). In any case, an understanding of the features of data.table
work data.table
necessary before using additional “wrappers”.
Stanislav Chistyakov is an expert on cloud technologies and machine learning.
I strongly advise reading the articles included in the package:
The theme of the R language is not the first time raised in our journal. Let's give you a couple of links to related articles:
Source: https://habr.com/ru/post/316032/
All Articles