
data.table package. Data.table is an extension of the R-package data.frame . In addition, those who use R for fast aggregation of large data sets (this is, in particular, about 100 GB of data in RAM) cannot do without this package.data.table package for R is very flexible and productive. It is easy and convenient to use it, programs in which it is used are written fairly quickly. This package is widely known in the circles of R-programmers. It is loaded more than 400 thousand times a month, it is used in almost 650 CRAN and Bioconductor-packages ( source ).datatable Python package, which is an analogue of data.table from the world R. The datatable package datatable clearly focused on processing large data sets. It is distinguished by high performance - both when working with data that fits completely in RAM, and when working with data that is larger than the amount of available RAM. It supports multithreaded data processing. In general, the datatable package may well be called the younger brother of data.table .
datatable Python module was created to solve this problem. This is a set of tools for performing operations with large (up to 100 GB) volumes of data on a single computer at the highest possible speed. datatable is sponsored by H2O.ai , and the first user of the package is Driverless.ai .datatable package also strive to make it convenient for users to work with it. It is, in particular, a powerful API and well-thought-out error messages. In this article we will talk about how to use datatable , and how it looks in comparison with pandas when processing large data sets.datatable can be easily installed using pip : pip install datatable # Python 3.5 pip install https://s3.amazonaws.com/h2o-release/datatable/stable/datatable-0.8.0/datatable-0.8.0-cp35-cp35m-linux_x86_64.whl # Python 3.6 pip install https://s3.amazonaws.com/h2o-release/datatable/stable/datatable-0.8.0/datatable-0.8.0-cp36-cp36m-linux_x86_64.whl datatable does not work under Windows, but now work is being done in this direction, so Windows support is only a matter of time.datatable can be found here .datatable library. # import numpy as np import pandas as pd import datatable as dt Frame object. The basic unit of analysis in datatable is Frame . This is the same as a DataFrame from a pandas or SQL table. Namely, we are talking about data organized in the form of a two-dimensional array in which rows and columns can be distinguished. %%time datatable_df = dt.fread("data.csv") ____________________________________________________________________ CPU times: user 30 s, sys: 3.39 s, total: 33.4 s Wall time: 23.6 s fread() is a powerful and very fast mechanism. It can automatically detect and process parameters for the vast majority of text files, load data from .ZIP archives and from Excel files, retrieve data from URLs and do much more.datatable parser has the following features:pandas to read the same file. %%time pandas_df= pd.read_csv("data.csv") ___________________________________________________________ CPU times: user 47.5 s, sys: 12.1 s, total: 59.6 s Wall time: 1min 4s datatable clearly faster than pandas when reading large data sets. Pandas in our experiment takes more than a minute, and the time required for datatable is measured in seconds.Frame object of the datatable package can be converted to a DataFrame numpy or pandas object. This is done like this: numpy_df = datatable_df.to_numpy() pandas_df = datatable_df.to_pandas() Frame datatable object to a DataFrame pandas object and see how long it will take. %%time datatable_pandas = datatable_df.to_pandas() ___________________________________________________________________ CPU times: user 17.1 s, sys: 4 s, total: 21.1 s Wall time: 21.4 s Frame datatable object and then converting this object into a DataFrame pandas object takes less time than loading data into a DataFrame using pandas . Therefore, it is possible, if you plan to process large amounts of data using pandas , it will be better to load them with datatable tools, and then convert them into a DataFrame . type(datatable_pandas) ___________________________________________________________________ pandas.core.frame.DataFrame Frame object from datatable . They are very similar to similar properties of the DataFrame object from pandas : print(datatable_df.shape) # ( , ) print(datatable_df.names[:5]) # 5 print(datatable_df.stypes[:5]) # 5 ______________________________________________________________ (2260668, 145) ('id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv') (stype.bool8, stype.bool8, stype.int32, stype.int32, stype.float64) head() method is also available, which n first n lines: datatable_df.head(10) 
pandas is an operation that requires a lot of memory. In the case of datatable this is not the case. Here are the commands you can use to calculate various indicators in datatable : datatable_df.sum() datatable_df.nunique() datatable_df.sd() datatable_df.max() datatable_df.mode() datatable_df.min() datatable_df.nmodal() datatable_df.mean() datatable and pandas and analyze the time required to perform this operation. %%time datatable_df.mean() _______________________________________________________________ CPU times: user 5.11 s, sys: 51.8 ms, total: 5.16 s Wall time: 1.43 s pandas_df.mean() __________________________________________________________________ Throws memory error. pandas we could not get a result - a memory error was issued.Frame and DataFrame are data structures representing tables. In datatable , square brackets are used to perform data manipulations. This is reminiscent of how they work with ordinary matrices, but here you can use additional features when using square brackets.
DT[i, j] also used. Similar structures can be found in C, C ++ and R, in pandas and numpy packages, as well as in many other technologies. Consider performing common data manipulations in datatable .funded_amnt column: datatable_df[:,'funded_amnt'] 
datatable_df[:5,:3] 
%%time datatable_df.sort('funded_amnt_inv') _________________________________________________________________ CPU times: user 534 ms, sys: 67.9 ms, total: 602 ms Wall time: 179 ms %%time pandas_df.sort_values(by = 'funded_amnt_inv') ___________________________________________________________________ CPU times: user 8.76 s, sys: 2.87 s, total: 11.6 s Wall time: 12.4 s datatable and pandas .member_id : del datatable_df[:, 'member_id'] pandas , supports the ability to group data. Let's look at how to get the average for the funded_amound column, the data in which are grouped by the grade column. %%time for i in range(100): datatable_df[:, dt.sum(dt.f.funded_amnt), dt.by(dt.f.grade)] ____________________________________________________________________ CPU times: user 6.41 s, sys: 1.34 s, total: 7.76 s Wall time: 2.42 s .f construct. This is the so-called frame proxy, a simple mechanism that allows you to refer to the Frame object with which some actions are currently being performed. In our case, dt.f is the same as datatable_df . %%time for i in range(100): pandas_df.groupby("grade")["funded_amnt"].sum() ____________________________________________________________________ CPU times: user 12.9 s, sys: 859 ms, total: 13.7 s Wall time: 13.9 s loan_amnt lines for which the value of loan_amnt greater than funded_amnt . datatable_df[dt.f.loan_amnt>dt.f.funded_amnt,"loan_amnt"] Frame object can be written to a CSV file, which allows you to use data in the future. This is done like this: datatable_df.to_csv('output.csv') datatable methods for working with data here .datatable Python module is definitely faster than most pandas . It is also a real find for those who need to process very large data sets. So far, the only minus datatable in comparison with pandas is the amount of functionality. However, active work is underway on datatable , so it is quite possible that in the future datatable will surpass pandas in all directions.datatable package in your projects?

Source: https://habr.com/ru/post/455507/
All Articles