data.table
package. Data.table is an extension of the R-package data.frame . In addition, those who use R for fast aggregation of large data sets (this is, in particular, about 100 GB of data in RAM) cannot do without this package.data.table
package for R is very flexible and productive. It is easy and convenient to use it, programs in which it is used are written fairly quickly. This package is widely known in the circles of R-programmers. It is loaded more than 400 thousand times a month, it is used in almost 650 CRAN and Bioconductor-packages ( source ).datatable
Python package, which is an analogue of data.table
from the world R. The datatable
package datatable
clearly focused on processing large data sets. It is distinguished by high performance - both when working with data that fits completely in RAM, and when working with data that is larger than the amount of available RAM. It supports multithreaded data processing. In general, the datatable package may well be called the younger brother of data.table .datatable
Python module was created to solve this problem. This is a set of tools for performing operations with large (up to 100 GB) volumes of data on a single computer at the highest possible speed. datatable
is sponsored by H2O.ai , and the first user of the package is Driverless.ai .datatable
package also strive to make it convenient for users to work with it. It is, in particular, a powerful API and well-thought-out error messages. In this article we will talk about how to use datatable
, and how it looks in comparison with pandas
when processing large data sets.datatable
can be easily installed using pip
: pip install datatable
# Python 3.5 pip install https://s3.amazonaws.com/h2o-release/datatable/stable/datatable-0.8.0/datatable-0.8.0-cp35-cp35m-linux_x86_64.whl # Python 3.6 pip install https://s3.amazonaws.com/h2o-release/datatable/stable/datatable-0.8.0/datatable-0.8.0-cp36-cp36m-linux_x86_64.whl
datatable
does not work under Windows, but now work is being done in this direction, so Windows support is only a matter of time.datatable
can be found here .datatable
library. # import numpy as np import pandas as pd import datatable as dt
Frame
object. The basic unit of analysis in datatable
is Frame
. This is the same as a DataFrame
from a pandas
or SQL table. Namely, we are talking about data organized in the form of a two-dimensional array in which rows and columns can be distinguished. %%time datatable_df = dt.fread("data.csv") ____________________________________________________________________ CPU times: user 30 s, sys: 3.39 s, total: 33.4 s Wall time: 23.6 s
fread()
is a powerful and very fast mechanism. It can automatically detect and process parameters for the vast majority of text files, load data from .ZIP archives and from Excel files, retrieve data from URLs and do much more.datatable
parser has the following features:pandas
to read the same file. %%time pandas_df= pd.read_csv("data.csv") ___________________________________________________________ CPU times: user 47.5 s, sys: 12.1 s, total: 59.6 s Wall time: 1min 4s
datatable
clearly faster than pandas
when reading large data sets. Pandas
in our experiment takes more than a minute, and the time required for datatable
is measured in seconds.Frame
object of the datatable
package can be converted to a DataFrame
numpy
or pandas
object. This is done like this: numpy_df = datatable_df.to_numpy() pandas_df = datatable_df.to_pandas()
Frame
datatable
object to a DataFrame
pandas
object and see how long it will take. %%time datatable_pandas = datatable_df.to_pandas() ___________________________________________________________________ CPU times: user 17.1 s, sys: 4 s, total: 21.1 s Wall time: 21.4 s
Frame
datatable
object and then converting this object into a DataFrame
pandas
object takes less time than loading data into a DataFrame
using pandas
. Therefore, it is possible, if you plan to process large amounts of data using pandas
, it will be better to load them with datatable
tools, and then convert them into a DataFrame
. type(datatable_pandas) ___________________________________________________________________ pandas.core.frame.DataFrame
Frame
object from datatable
. They are very similar to similar properties of the DataFrame
object from pandas
: print(datatable_df.shape) # ( , ) print(datatable_df.names[:5]) # 5 print(datatable_df.stypes[:5]) # 5 ______________________________________________________________ (2260668, 145) ('id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv') (stype.bool8, stype.bool8, stype.int32, stype.int32, stype.float64)
head()
method is also available, which n
first n
lines: datatable_df.head(10)
pandas
is an operation that requires a lot of memory. In the case of datatable
this is not the case. Here are the commands you can use to calculate various indicators in datatable
: datatable_df.sum() datatable_df.nunique() datatable_df.sd() datatable_df.max() datatable_df.mode() datatable_df.min() datatable_df.nmodal() datatable_df.mean()
datatable
and pandas
and analyze the time required to perform this operation. %%time datatable_df.mean() _______________________________________________________________ CPU times: user 5.11 s, sys: 51.8 ms, total: 5.16 s Wall time: 1.43 s
pandas_df.mean() __________________________________________________________________ Throws memory error.
pandas
we could not get a result - a memory error was issued.Frame
and DataFrame
are data structures representing tables. In datatable
, square brackets are used to perform data manipulations. This is reminiscent of how they work with ordinary matrices, but here you can use additional features when using square brackets.DT[i, j]
also used. Similar structures can be found in C, C ++ and R, in pandas
and numpy
packages, as well as in many other technologies. Consider performing common data manipulations in datatable
.funded_amnt
column: datatable_df[:,'funded_amnt']
datatable_df[:5,:3]
%%time datatable_df.sort('funded_amnt_inv') _________________________________________________________________ CPU times: user 534 ms, sys: 67.9 ms, total: 602 ms Wall time: 179 ms
%%time pandas_df.sort_values(by = 'funded_amnt_inv') ___________________________________________________________________ CPU times: user 8.76 s, sys: 2.87 s, total: 11.6 s Wall time: 12.4 s
datatable
and pandas
.member_id
: del datatable_df[:, 'member_id']
pandas
, supports the ability to group data. Let's look at how to get the average for the funded_amound
column, the data in which are grouped by the grade
column. %%time for i in range(100): datatable_df[:, dt.sum(dt.f.funded_amnt), dt.by(dt.f.grade)] ____________________________________________________________________ CPU times: user 6.41 s, sys: 1.34 s, total: 7.76 s Wall time: 2.42 s
.f
construct. This is the so-called frame proxy, a simple mechanism that allows you to refer to the Frame
object with which some actions are currently being performed. In our case, dt.f
is the same as datatable_df
. %%time for i in range(100): pandas_df.groupby("grade")["funded_amnt"].sum() ____________________________________________________________________ CPU times: user 12.9 s, sys: 859 ms, total: 13.7 s Wall time: 13.9 s
loan_amnt
lines for which the value of loan_amnt
greater than funded_amnt
. datatable_df[dt.f.loan_amnt>dt.f.funded_amnt,"loan_amnt"]
Frame
object can be written to a CSV file, which allows you to use data in the future. This is done like this: datatable_df.to_csv('output.csv')
datatable
methods for working with data here .datatable
Python module is definitely faster than most pandas
. It is also a real find for those who need to process very large data sets. So far, the only minus datatable
in comparison with pandas
is the amount of functionality. However, active work is underway on datatable
, so it is quite possible that in the future datatable
will surpass pandas
in all directions.datatable
package in your projects?Source: https://habr.com/ru/post/455507/
All Articles