dplyr and data.table . Each package has its strengths. dplyr elegant and similar to natural language, while data.table concise, with it you can do a lot of things in just one line. Moreover, in some cases, data.table faster (comparative analysis is available here ), and this can determine the choice if there are memory or performance limitations. Comparing dplyr and data.table can also be read on Stack Overflow and Quora .data.table , and here for dplyr . You can also read tutorials on dplyr on DataScience +.dplyr and data.table for a long time to work with data. If someone is familiar with only one of the packages, it may be useful to look at the code that does the same thing, in both, to study the second.dplyr that are designed to perform most data operations. Select - to select one or more columns. Filter - to select rows based on any criteria. Arrange - to sort data by one or more columns in ascending or descending order. Mutate - to add new columns to the data. Summarise - to select part of the data.data.table very short common format - DT [ i, j, by ], which can be interpreted as follows: take DT, select lines using i , and calculate j by grouping by . library(dplyr) library(data.table) library(lubridate) library(jsonlite) library(tidyr) library(ggplot2) library(compare) fromJSON package jsonlite . Since JSON is the standard data format for asynchronous interaction between the browser and the server, it is helpful to understand the code below, through which data is received. An introduction to working with JSON data with the jsonlite package can be found here and here . However, if you want to focus only on the dplyr and data.table , you can safely run the code below in two different windows and don’t go into details. spending=fromJSON("https://data.medicare.gov/api/views/nrth-mfg3/rows.json?accessType=DOWNLOAD") names(spending) "meta" "data" meta=spending$meta hospital_spending=data.frame(spending$data) colnames(hospital_spending)=make.names(meta$view$columns$name) hospital_spending=select(hospital_spending,-c(sid:meta)) glimpse(hospital_spending) Observations: 70598 Variables: $ Hospital.Name (fctr) SOUTHEAST ALABAMA MEDICAL CENT... $ Provider.Number. (fctr) 010001, 010001, 010001, 010001... $ State (fctr) AL, AL, AL, AL, AL, AL, AL, AL... $ Period (fctr) 1 to 3 days Prior to Index Hos... $ Claim.Type (fctr) Home Health Agency, Hospice, I... $ Avg.Spending.Per.Episode..Hospital. (fctr) 12, 1, 6, 160, 1, 6, 462, 0, 0... $ Avg.Spending.Per.Episode..State. (fctr) 14, 1, 6, 85, 2, 9, 492, 0, 0,... $ Avg.Spending.Per.Episode..Nation. (fctr) 13, 1, 5, 117, 2, 9, 532, 0, 0... $ Percent.of.Spending..Hospital. (fctr) 0.06, 0.01, 0.03, 0.84, 0.01, ... $ Percent.of.Spending..State. (fctr) 0.07, 0.01, 0.03, 0.46, 0.01, ... $ Percent.of.Spending..Nation. (fctr) 0.07, 0.00, 0.03, 0.58, 0.01, ... $ Measure.Start.Date (fctr) 2014-01-01T00:00:00, 2014-01-0... $ Measure.End.Date (fctr) 2014-12-31T00:00:00, 2014-12-3... cols = 6:11; # , hospital_spending[,cols] <- lapply(hospital_spending[,cols], as.numeric) lubridate package to fix them. cols = 12:13; # hospital_spending[,cols] <- lapply(hospital_spending[,cols], ymd_hms) sapply(hospital_spending, class) $Hospital.Name "factor" $Provider.Number. "factor" $State "factor" $Period "factor" $Claim.Type "factor" $Avg.Spending.Per.Episode..Hospital. "numeric" $Avg.Spending.Per.Episode..State. "numeric" $Avg.Spending.Per.Episode..Nation. "numeric" $Percent.of.Spending..Hospital. "numeric" $Percent.of.Spending..State. "numeric" $Percent.of.Spending..Nation. "numeric" $Measure.Start.Date "POSIXct" "POSIXt" $Measure.End.Date "POSIXct" "POSIXt" data.table() function: hospital_spending_DT = data.table(hospital_spending) class(hospital_spending_DT) "data.table" "data.frame" dplyr , use the select verb. In data.table , in turn, you can set the column names. from_dplyr = select(hospital_spending, Hospital.Name) from_data_table = hospital_spending_DT[,.(Hospital.Name)] dplyr and data.table same. compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes from_dplyr = select(hospital_spending, -Hospital.Name) from_data_table = hospital_spending_DT[,!c("Hospital.Name"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes := , which changes the input data table (data.table) by reference.copy() function, which creates a copy of the original object, i.e. any following operation on the data by reference to the copy will not affect the initial object. DT=copy(hospital_spending_DT) DT=DT[,Hospital.Name:=NULL] "Hospital.Name"%in%names(DT) FALSE DT=copy(hospital_spending_DT) DT=DT[,c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"):=NULL] c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date")%in%names(DT) FALSE FALSE FALSE FALSE from_dplyr = select(hospital_spending, Hospital.Name,State,Measure.Start.Date,Measure.End.Date) from_data_table = hospital_spending_DT[,.(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes from_dplyr = select(hospital_spending, -c(Hospital.Name,State,Measure.Start.Date,Measure.End.Date)) from_data_table = hospital_spending_DT[,!c("Hospital.Name","State","Measure.Start.Date","Measure.End.Date"),with=FALSE] compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes dplyr has the contains() , starts_with() and ends_with() that can be used with the select verb. Regular expressions are allowed in data.table . As an example, select the columns containing the word "Date" in the name. from_dplyr = select(hospital_spending,contains("Date")) from_data_table = subset(hospital_spending_DT,select=grep("Date",names(hospital_spending_DT))) compare(from_dplyr,from_data_table, allowAll=TRUE) TRUE dropped attributes names(from_dplyr) "Measure.Start.Date" "Measure.End.Date" setnames(hospital_spending_DT,c("Hospital.Name", "Measure.Start.Date","Measure.End.Date"), c("Hospital","Start_Date","End_Date")) names(hospital_spending_DT) "Hospital" "Provider.Number." "State" "Period" "Claim.Type" "Avg.Spending.Per.Episode..Hospital." "Avg.Spending.Per.Episode..State." "Avg.Spending.Per.Episode..Nation." "Percent.of.Spending..Hospital." "Percent.of.Spending..State." "Percent.of.Spending..Nation." "Start_Date" "End_Date" hospital_spending = rename(hospital_spending,Hospital= Hospital.Name, Start_Date=Measure.Start.Date,End_Date=Measure.End.Date) compare(hospital_spending,hospital_spending_DT, allowAll=TRUE) TRUE dropped attributes Source: https://habr.com/ru/post/301914/
All Articles