The best packages for working with data in R, part 1

There are two excellent packages for working with data in R - dplyr and data.table . Each package has its strengths. dplyr elegant and similar to natural language, while data.table concise, with it you can do a lot of things in just one line. Moreover, in some cases, data.table faster (comparative analysis is available here ), and this can determine the choice if there are memory or performance limitations. Comparing dplyr and data.table can also be read on Stack Overflow and Quora .

Here you can find a guide and a short description of data.table , and here for dplyr . You can also read tutorials on dplyr on DataScience +.

Context

I have been using dplyr and data.table for a long time to work with data. If someone is familiar with only one of the packages, it may be useful to look at the code that does the same thing, in both, to study the second.

dplyr

There are five verbs in dplyr that are designed to perform most data operations. Select - to select one or more columns. Filter - to select rows based on any criteria. Arrange - to sort data by one or more columns in ascending or descending order. Mutate - to add new columns to the data. Summarise - to select part of the data.
')

data.table

The data.table very short common format - DT [ i, j, by ], which can be interpreted as follows: take DT, select lines using i , and calculate j by grouping by .

Work with data

First, install some packages for our project.

 library(dplyr) library(data.table) library(lubridate) library(jsonlite) library(tidyr) library(ggplot2) library(compare)

We will use data from DATA.GOV . This is data on payments for claims of state medical insurance, they can be downloaded from here . Load the data in JSON format using the fromJSON package jsonlite . Since JSON is the standard data format for asynchronous interaction between the browser and the server, it is helpful to understand the code below, through which data is received. An introduction to working with JSON data with the jsonlite package can be found here and here . However, if you want to focus only on the dplyr and data.table , you can safely run the code below in two different windows and don’t go into details.

 spending=fromJSON("https://data.medicare.gov/api/views/nrth-mfg3/rows.json?accessType=DOWNLOAD") names(spending)

 "meta" "data"

 meta=spending$meta hospital_spending=data.frame(spending$data) colnames(hospital_spending)=make.names(meta$view$columns$name) hospital_spending=select(hospital_spending,-c(sid:meta)) glimpse(hospital_spending)

 Observations: 70598 Variables: $ Hospital.Name (fctr) SOUTHEAST ALABAMA MEDICAL CENT... $ Provider.Number. (fctr) 010001, 010001, 010001, 010001... $ State (fctr) AL, AL, AL, AL, AL, AL, AL, AL... $ Period (fctr) 1 to 3 days Prior to Index Hos... $ Claim.Type (fctr) Home Health Agency, Hospice, I... $ Avg.Spending.Per.Episode..Hospital. (fctr) 12, 1, 6, 160, 1, 6, 462, 0, 0... $ Avg.Spending.Per.Episode..State. (fctr) 14, 1, 6, 85, 2, 9, 492, 0, 0,... $ Avg.Spending.Per.Episode..Nation. (fctr) 13, 1, 5, 117, 2, 9, 532, 0, 0... $ Percent.of.Spending..Hospital. (fctr) 0.06, 0.01, 0.03, 0.84, 0.01, ... $ Percent.of.Spending..State. (fctr) 0.07, 0.01, 0.03, 0.46, 0.01, ... $ Percent.of.Spending..Nation. (fctr) 0.07, 0.00, 0.03, 0.58, 0.01, ... $ Measure.Start.Date (fctr) 2014-01-01T00:00:00, 2014-01-0... $ Measure.End.Date (fctr) 2014-12-31T00:00:00, 2014-12-3...

As shown above, all columns are imported as factor variables. Let's make numeric data in numbers.

 cols = 6:11; #  ,     hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric)

The last two columns indicate the beginning and end of the measurement. Let's use the lubridate package to fix them.

 cols = 12:13; #       hospital_spending[,cols] <- lapply(hospital_spending[,cols], ymd_hms)

Now let's make sure the columns are of the correct type.

 sapply(hospital_spending, class)

 $Hospital.Name "factor" $Provider.Number. "factor" $State "factor" $Period "factor" $Claim.Type "factor" $Avg.Spending.Per.Episode..Hospital. "numeric" $Avg.Spending.Per.Episode..State. "numeric" $Avg.Spending.Per.Episode..Nation. "numeric" $Percent.of.Spending..Hospital. "numeric" $Percent.of.Spending..State. "numeric" $Percent.of.Spending..Nation. "numeric" $Measure.Start.Date "POSIXct" "POSIXt" $Measure.End.Date "POSIXct" "POSIXt"

Create a table with data

You can create a table with data (data.table) using the data.table() function:

 hospital_spending_DT = data.table(hospital_spending) class(hospital_spending_DT)

 "data.table" "data.frame"

Select some columns

To select columns in dplyr , use the select verb. In data.table , in turn, you can set the column names.

Single variable selection

Choose the variable "Hospital Name".

 from_dplyr = select(hospital_spending, Hospital.Name) from_data_table = hospital_spending_DT[,.(Hospital.Name)]

Now you need to make sure that the results of dplyr and data.table same.

 compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

Delete one variable

 from_dplyr = select(hospital_spending, -Hospital.Name) from_data_table = hospital_spending_DT[,!c("Hospital.Name"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

You can also take the function := , which changes the input data table (data.table) by reference.
We also use the copy() function, which creates a copy of the original object, i.e. any following operation on the data by reference to the copy will not affect the initial object.

 DT=copy(hospital_spending_DT) DT=DT[,Hospital.Name:=NULL] "Hospital.Name"%in%names(DT)

 FALSE

Similarly, you can simultaneously delete several variables:

 DT=copy(hospital_spending_DT) DT=DT[,c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"):=NULL] c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date")%in%names(DT)

 FALSE FALSE FALSE FALSE

Selection of several variables

Let's select the variables Hospital.Name, State, Measure.Start.Date and Measure.End.Date.

 from_dplyr = select(hospital_spending, Hospital.Name,State,Measure.Start.Date,Measure.End.Date) from_data_table = hospital_spending_DT[,.(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)] compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

Delete multiple variables

Let's now remove the variables Hospital.Name, State, Measure.Start.Date and Measure.End.Date from the hospital_spending dataset and the data.table hospital_spending_DT data set.

 from_dplyr = select(hospital_spending, -c(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)) from_data_table = hospital_spending_DT[,!c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

dplyr has the contains() , starts_with() and ends_with() that can be used with the select verb. Regular expressions are allowed in data.table . As an example, select the columns containing the word "Date" in the name.

 from_dplyr = select(hospital_spending,contains("Date")) from_data_table = subset(hospital_spending_DT,select=grep("Date",names(hospital_spending_DT))) compare(from_dplyr,from_data_table, allowAll=TRUE)

 TRUE dropped attributes

 names(from_dplyr)

 "Measure.Start.Date" "Measure.End.Date"

Rename columns

 setnames(hospital_spending_DT,c("Hospital.Name", "Measure.Start.Date","Measure.End.Date"), c("Hospital","Start_Date","End_Date")) names(hospital_spending_DT)

 "Hospital" "Provider.Number." "State" "Period" "Claim.Type" "Avg.Spending.Per.Episode..Hospital." "Avg.Spending.Per.Episode..State." "Avg.Spending.Per.Episode..Nation." "Percent.of.Spending..Hospital." "Percent.of.Spending..State." "Percent.of.Spending..Nation." "Start_Date" "End_Date"

 hospital_spending = rename(hospital_spending,Hospital= Hospital.Name, Start_Date=Measure.Start.Date,End_Date=Measure.End.Date) compare(hospital_spending,hospital_spending_DT, allowAll=TRUE)

 TRUE dropped attributes

Source: https://habr.com/ru/post/301914/

All Articles