📜 ⬆️ ⬇️

Aggregate functions in dplyr

summarise() used with aggregation functions that take a vector of values ​​as input and return one. The summarise_each() function offers a different approach to summarise() with the same results.

The purpose of this article is to compare the behavior of summarise() and summarise_each() , taking into account two factors that we can control:

1. How many variables to operate

2. How many functions are applied to each variable?

It turns out four options:

Also check out these four cases with and without the group_by() option.

mtcars data mtcars


For this article, we use the well-known mtcars data mtcars .
')
First we convert it to a tbl_df object. With the standard data.frame object data.frame nothing will happen, but a much better method of output will be available.

Finally, in order to make it easy to navigate, we select only four variables with which we will work:
 mtcars <- mtcars %>% tbl_df() %>% select(cyl , mpg, disp) 

Option 1: apply one function to one variable


In this case, summarise() will produce a simple result:
 #   mtcars %>% summarise (mean_mpg = mean(mpg)) 

 ## Source: local data frame [1 x 1] ## ## mean_mpg ## (dbl) ## 1 20.09062 

 #   mtcars %>% group_by(cyl) %>% summarise (mean_mpg = mean(mpg)) 

 ## Source: local data frame [3 x 2] ## ## cyl mean_mpg ## (dbl) (dbl) ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000 

The summarise_each() function could also be used, but its use is less reasonable from the point of view of code clarity.
 #   mtcars %>% summarise_each (funs(mean) , mean_mpg = mpg) 

 ## Source: local data frame [1 x 1] ## ## mean_mpg ## (dbl) ## 1 20.09062 

 #   mtcars %>% group_by(cyl) %>% summarise_each (funs(mean) , mean_mpg = mpg) 

 ## Source: local data frame [3 x 2] ## ## cyl mean_mpg ## (dbl) (dbl) ## 1 4 26.66364 ## 2 6 19.74286 ## 3 8 15.10000 

Option 2: apply many functions to one variable


In this case, both summarise() and summarise_each() can be used.

The summarise() function has a more intuitive syntax:
 #   mtcars %>% summarise (min_mpg = min(mpg), max_mpg = max(mpg)) 

 ## Source: local data frame [1 x 2] ## ## min_mpg max_mpg ## (dbl) (dbl) ## 1 10.4 33.9 

 #   mtcars %>% group_by(cyl) %>% summarise (min_mpg = min(mpg), max_mpg = max(mpg)) 

 ## Source: local data frame [3 x 3] ## ## cyl min_mpg max_mpg ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2 

You can simply set the names of the output variables:
 max_mpg = max(mpg) 

When many functions are applied to one variable, summarise_each() uses a more compact and neat syntax:
 #   mtcars %>% summarise_each (funs(min, max), mpg) 

 ## Source: local data frame [1 x 2] ## ## min max ## (dbl) (dbl) ## 1 10.4 33.9 

 #   mtcars %>% group_by(cyl) %>% summarise_each (funs(min, max), mpg) 

 ## Source: local data frame [3 x 3] ## ## cyl min max ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2 

The names of the output variables are given by the names of the functions: min and max . In this case, we lose the name of the variable to which the function is applied. If you need something like min_mpg and max_mpg , you need to rename the functions inside funs() :
 #   mtcars %>% summarise_each (funs(min_mpg = min, max_mpg = max), mpg) 

 ## Source: local data frame [1 x 2] ## ## min_mpg max_mpg ## (dbl) (dbl) ## 1 10.4 33.9 

 #   mtcars %>% group_by(cyl) %>% summarise_each (funs(min_mpg = min, max_mpg = max), mpg) 

 ## Source: local data frame [3 x 3] ## ## cyl min_mpg max_mpg ## (dbl) (dbl) (dbl) ## 1 4 21.4 33.9 ## 2 6 17.8 21.4 ## 3 8 10.4 19.2 

Option 3: Apply the same function to many variables.


This option is very similar to the previous one. You can use both functions: summarise() and summarise_each() .

The summarise() function again has a more intuitive syntax, and the names of the output variables can be set in the usual simple form:
 max_mpg = max(mpg) 

 #   mtcars %>% summarise(mean_mpg = mean(mpg), mean_disp = mean(disp)) 

 ## Source: local data frame [1 x 2] ## ## mean_mpg mean_disp ## (dbl) (dbl) ## 1 20.09062 230.7219 

 #   mtcars %>% group_by(cyl) %>% summarise(mean_mpg = mean(mpg), mean_disp = mean(disp)) 

 ## Source: local data frame [3 x 3] ## ## cyl mean_mpg mean_disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000 

When one function is applied to many variables, summarise_each() uses a more compact and neat syntax:
 #   mtcars %>% summarise_each(funs(mean) , mpg, disp) 

 ## Source: local data frame [1 x 2] ## ## mpg disp ## (dbl) (dbl) ## 1 20.09062 230.7219 

 #   mtcars %>% group_by(cyl) %>% summarise_each (funs(mean), mpg, disp) 

 ## Source: local data frame [3 x 3] ## ## cyl mpg disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000 

The names of the output variables are determined by the names of the variables: mpg and disp . In this case, we lose the name of the function applied to the variables - mean() . Probably would like something like mean_mpg and mean_disp . In order to achieve this, you need to appropriately rename the variables passed to the "..." inside summarise_each() :
 #   mtcars %>% summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp) 

 ## Source: local data frame [1 x 2] ## ## mean_mpg mean_disp ## (dbl) (dbl) ## 1 20.09062 230.7219 

 #   mtcars %>% group_by(cyl) %>% summarise_each(funs(mean) , mean_mpg = mpg, mean_disp = disp) 

 ## Source: local data frame [3 x 3] ## ## cyl mean_mpg mean_disp ## (dbl) (dbl) (dbl) ## 1 4 26.66364 105.1364 ## 2 6 19.74286 183.3143 ## 3 8 15.10000 353.1000 

Option 4: Apply many functions to many variables.


As in the previous cases, both functions, both summarise() and summarise_each() , have their advantages.

The summarise() function again has a more intuitive syntax, and the names of the output variables can be set in the usual simple form:
 max_mpg = max(mpg) 

 #   mtcars %>% summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp)) 

 ## Source: local data frame [1 x 4] ## ## min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472 

 #    mtcars %>% group_by(cyl) %>% summarise(min_mpg = min(mpg) , min_disp = min(disp), max_mpg = max(mpg) , max_disp = max(disp)) 

 ## Source: local data frame [3 x 5] ## ## cyl min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0 

When many functions are applied to many variables, summarise_each() uses a more compact and neat syntax:
 #   mtcars %>% summarise_each(funs(min, max) , mpg, disp) 

 ## Source: local data frame [1 x 4] ## ## mpg_min disp_min mpg_max disp_max ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472 

 #    mtcars %>% group_by(cyl) %>% summarise_each(funs(min, max) , mpg, disp) 

 ## Source: local data frame [3 x 5] ## ## cyl mpg_min disp_min mpg_max disp_max ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0 

The names of the output variables can be specified as follows: variable_function , i.e. mpg_min , disp_min , etc.

Reverse variable naming, i.e. function_variable , not possible when summarise_each() called. This can be implemented using a separate command.
 #   mtcars %>% summarise_each(funs(min, max) , mpg, disp) %>% setNames(c("min_mpg", "min_disp", "max_mpg", "max_disp")) 

 ## Source: local data frame [1 x 4] ## ## min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) ## 1 10.4 71.1 33.9 472 

 #   mtcars %>% group_by(cyl) %>% summarise_each(funs(min, max) , mpg, disp) %>% setNames(c("gear", "min_mpg", "min_disp", "max_mpg", "max_disp")) 

 ## Source: local data frame [3 x 5] ## ## gear min_mpg min_disp max_mpg max_disp ## (dbl) (dbl) (dbl) (dbl) (dbl) ## 1 4 21.4 71.1 33.9 146.7 ## 2 6 17.8 145.0 21.4 258.0 ## 3 8 10.4 275.8 19.2 472.0 

findings


When using functions that return the result of a unit length, there are two main candidates:

The summarise() function has a simpler syntax, and the summarise_each() function has a more compact one.

Because of this, summarise() more suitable for one variable of a single function. The greater the number of variables or functions, the more justified is the use of summarise_each() .

The summarise_each() function has its own way of naming output variables:

Option 2: apply many functions to one variable

The names of the output variables are determined by the function names. In this case, we lose the name of the variable to which the functions are applied.

Option 3: Apply the same function to many variables.

The names of the output variables are determined by the names of the variables . In this case, we lose the name of the function applied to the variables.

Option 4: Apply many functions to many variables.

The names of the output variables are determined by the variable_function notation. Inside the summarise_each() call, another naming is not possible.

Source: https://habr.com/ru/post/281747/


All Articles