dplyr
and data.table
. Each package has its strengths. dplyr
elegant and similar to natural language, while data.table
concise, with it you can do a lot of things in just one line. Moreover, in some cases, data.table
faster (comparative analysis is available here ), and this can determine the choice if there are memory or performance limitations. Comparing dplyr
and data.table
can also be read on Stack Overflow and Quora .data.table
, and here for dplyr
. You can also read tutorials on dplyr
on DataScience +.dplyr
and data.table
for a long time to work with data. If someone is familiar with only one of the packages, it may be useful to look at the code that does the same thing, in both, to study the second.dplyr
that are designed to perform most data operations. Select - to select one or more columns. Filter - to select rows based on any criteria. Arrange - to sort data by one or more columns in ascending or descending order. Mutate - to add new columns to the data. Summarise - to select part of the data.data.table
very short common format - DT [ i, j, by ], which can be interpreted as follows: take DT, select lines using i , and calculate j by grouping by . library(dplyr) library(data.table) library(lubridate) library(jsonlite) library(tidyr) library(ggplot2) library(compare)
fromJSON
package jsonlite
. Since JSON is the standard data format for asynchronous interaction between the browser and the server, it is helpful to understand the code below, through which data is received. An introduction to working with JSON data with the jsonlite
package can be found here and here . However, if you want to focus only on the dplyr
and data.table
, you can safely run the code below in two different windows and don’t go into details. spending=fromJSON("https://data.medicare.gov/api/views/nrth-mfg3/rows.json?accessType=DOWNLOAD") names(spending)
"meta" "data"
meta=spending$meta hospital_spending=data.frame(spending$data) colnames(hospital_spending)=make.names(meta$view$columns$name) hospital_spending=select(hospital_spending,-c(sid:meta)) glimpse(hospital_spending)
Observations: 70598 Variables: $ Hospital.Name (fctr) SOUTHEAST ALABAMA MEDICAL CENT... $ Provider.Number. (fctr) 010001, 010001, 010001, 010001... $ State (fctr) AL, AL, AL, AL, AL, AL, AL, AL... $ Period (fctr) 1 to 3 days Prior to Index Hos... $ Claim.Type (fctr) Home Health Agency, Hospice, I... $ Avg.Spending.Per.Episode..Hospital. (fctr) 12, 1, 6, 160, 1, 6, 462, 0, 0... $ Avg.Spending.Per.Episode..State. (fctr) 14, 1, 6, 85, 2, 9, 492, 0, 0,... $ Avg.Spending.Per.Episode..Nation. (fctr) 13, 1, 5, 117, 2, 9, 532, 0, 0... $ Percent.of.Spending..Hospital. (fctr) 0.06, 0.01, 0.03, 0.84, 0.01, ... $ Percent.of.Spending..State. (fctr) 0.07, 0.01, 0.03, 0.46, 0.01, ... $ Percent.of.Spending..Nation. (fctr) 0.07, 0.00, 0.03, 0.58, 0.01, ... $ Measure.Start.Date (fctr) 2014-01-01T00:00:00, 2014-01-0... $ Measure.End.Date (fctr) 2014-12-31T00:00:00, 2014-12-3...
cols = 6:11; # , hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric)
lubridate
package to fix them. cols = 12:13; # hospital_spending[,cols] <- lapply(hospital_spending[,cols], ymd_hms)
sapply(hospital_spending, class)
$Hospital.Name "factor" $Provider.Number. "factor" $State "factor" $Period "factor" $Claim.Type "factor" $Avg.Spending.Per.Episode..Hospital. "numeric" $Avg.Spending.Per.Episode..State. "numeric" $Avg.Spending.Per.Episode..Nation. "numeric" $Percent.of.Spending..Hospital. "numeric" $Percent.of.Spending..State. "numeric" $Percent.of.Spending..Nation. "numeric" $Measure.Start.Date "POSIXct" "POSIXt" $Measure.End.Date "POSIXct" "POSIXt"
data.table()
function: hospital_spending_DT = data.table(hospital_spending) class(hospital_spending_DT)
"data.table" "data.frame"
dplyr
, use the select
verb. In data.table
, in turn, you can set the column names. from_dplyr = select(hospital_spending, Hospital.Name) from_data_table = hospital_spending_DT[,.(Hospital.Name)]
dplyr
and data.table
same. compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE dropped attributes
from_dplyr = select(hospital_spending, -Hospital.Name) from_data_table = hospital_spending_DT[,!c("Hospital.Name"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE dropped attributes
:=
, which changes the input data table (data.table) by reference.copy()
function, which creates a copy of the original object, i.e. any following operation on the data by reference to the copy will not affect the initial object. DT=copy(hospital_spending_DT) DT=DT[,Hospital.Name:=NULL] "Hospital.Name"%in%names(DT)
FALSE
DT=copy(hospital_spending_DT) DT=DT[,c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"):=NULL] c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date")%in%names(DT)
FALSE FALSE FALSE FALSE
from_dplyr = select(hospital_spending, Hospital.Name,State,Measure.Start.Date,Measure.End.Date) from_data_table = hospital_spending_DT[,.(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)] compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE dropped attributes
from_dplyr = select(hospital_spending, -c(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)) from_data_table = hospital_spending_DT[,!c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE dropped attributes
dplyr
has the contains()
, starts_with()
and ends_with()
that can be used with the select
verb. Regular expressions are allowed in data.table
. As an example, select the columns containing the word "Date" in the name. from_dplyr = select(hospital_spending,contains("Date")) from_data_table = subset(hospital_spending_DT,select=grep("Date",names(hospital_spending_DT))) compare(from_dplyr,from_data_table, allowAll=TRUE)
TRUE dropped attributes
names(from_dplyr)
"Measure.Start.Date" "Measure.End.Date"
setnames(hospital_spending_DT,c("Hospital.Name", "Measure.Start.Date","Measure.End.Date"), c("Hospital","Start_Date","End_Date")) names(hospital_spending_DT)
"Hospital" "Provider.Number." "State" "Period" "Claim.Type" "Avg.Spending.Per.Episode..Hospital." "Avg.Spending.Per.Episode..State." "Avg.Spending.Per.Episode..Nation." "Percent.of.Spending..Hospital." "Percent.of.Spending..State." "Percent.of.Spending..Nation." "Start_Date" "End_Date"
hospital_spending = rename(hospital_spending,Hospital= Hospital.Name, Start_Date=Measure.Start.Date,End_Date=Measure.End.Date) compare(hospital_spending,hospital_spending_DT, allowAll=TRUE)
TRUE dropped attributes
Source: https://habr.com/ru/post/301914/
All Articles