With group_by()
, dplyr makes it easy to apply operations over different groups of rows:
starwars %>%
group_by(gender) %>%
summarise(mean(height, na.rm = TRUE))
#> # A tibble: 5 x 2
#> gender `mean(height, na.rm = TRUE)`
#> <chr> <dbl>
#> 1 <NA> 120
#> 2 female 165.
#> 3 hermaphrodite 175
#> 4 male 179.
#> # … with 1 more row
Sometimes it is also useful to apply the same operation over different columns. We need a different approach in that case, we need to apply, or map functions over the columns. This requires a little more programming knowledge because you need to wrap your operation in a function and pass that function to dplyr verbs. However, this knowledge is widely reusable in base R (apply family) and the tidyverse (map family). To brush up on the notion of iteration with functions, read the iteration chapter of the online book R4DS.
dplyr provides variants of the main data manipulation verbs that map functions over a selection of columns. These verbs are known as the scoped variants and are recognizable from their _at
, _if
and _all
suffixes.
Scoped verbs support three sorts of selection:
_all
verbs operate on all columns of the data frame. You can summarise all columns of a data frame within groups with summarise_all()
:
_if
verbs operate conditionally, on all columns for which a predicate returns TRUE
. If you are familiar with purrr, the idea is similar to the conditional mapper purrr::map_if()
. Promoting all character columns of a data frame as grouping variables is as simple as:
starwars %>% group_by_if(is.character)
#> # A tibble: 87 x 13
#> # Groups: name, hair_color, skin_color, eye_color, gender, homeworld,
#> # species [87]
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Luke… 172 77 blond fair blue 19 male
#> 2 C-3PO 167 75 <NA> gold yellow 112 <NA>
#> 3 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> 4 Dart… 202 136 none white yellow 41.9 male
#> # … with 83 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
_at
verbs operate on a selection of columns. You can supply integer vectors of column positions or character vectors of colunm names.
mtcars %>% summarise_at(1:2, mean)
#> mpg cyl
#> 1 20.09062 6.1875
mtcars %>% summarise_at(c("disp", "drat"), median)
#> disp drat
#> 1 196.3 3.695
More interestingly, you can use vars()
[^fn:vars] to supply the same sort of expressions you would pass to select()
! The selection helpers make it very convenient to craft a selection of columns to map over.
The scoped variants of mutate()
and summarise()
are the closest analogue to base::lapply()
and purrr::map()
. Unlike pure list mappers, the scoped verbs fully implement the dplyr semantics, such as groupwise vectorisation or the summary constraints:
# map() returns a simple list with the results
mtcars[1:5] %>% purrr::map(mean)
#> $mpg
#> [1] 20.09062
#>
#> $cyl
#> [1] 6.1875
#>
#> $disp
#> [1] 230.7219
#>
#> $hp
#> [1] 146.6875
#>
#> $drat
#> [1] 3.596563
# `mutate_` variants recycle to group size
mtcars[1:5] %>% mutate_all(mean)
#> # A tibble: 32 x 5
#> mpg cyl disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 20.1 6.19 231. 147. 3.60
#> 2 20.1 6.19 231. 147. 3.60
#> 3 20.1 6.19 231. 147. 3.60
#> 4 20.1 6.19 231. 147. 3.60
#> # … with 28 more rows
# `summarise_` variants enforce a size 1 constraint
mtcars[1:5] %>% summarise_all(mean)
#> # A tibble: 1 x 5
#> mpg cyl disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 20.1 6.19 231. 147. 3.60
# All scoped verbs know about groups
mtcars[1:5] %>% group_by(cyl) %>% summarise_all(mean)
#> # A tibble: 3 x 5
#> cyl mpg disp hp drat
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 26.7 105. 82.6 4.07
#> 2 6 19.7 183. 122. 3.59
#> 3 8 15.1 353. 209. 3.23
The other scoped variants also accept optional functions to map over the selection of columns. For instance, you could group by a selection of variables and transform them on the fly:
iris %>% group_by_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> # Groups: Species [3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> # … with 146 more rows
or transform the column names of selected variables:
storms %>% select_at(vars(name:hour), toupper)
#> # A tibble: 10,010 x 5
#> NAME YEAR MONTH DAY HOUR
#> <chr> <dbl> <dbl> <int> <dbl>
#> 1 Amy 1975 6 27 0
#> 2 Amy 1975 6 27 6
#> 3 Amy 1975 6 27 12
#> 4 Amy 1975 6 27 18
#> # … with 10,006 more rows
The scoped variants lie at the intersection of purrr and dplyr and combine the rowwise looping mechanisms of dplyr with the columnwise mapping of purrr. This is a powerful combination.
funs()
to purrr-style formulasHistorically, mapping functions has been done through a special syntax that was unique to dplyr. The funs()
helper would take quoted expression in which .
represented the function input. In order to make the tidyverse more consistent and to reduce the variety of syntax to learn, we have deprecated this idiosyncratic interface in favour of purrr-style formulas and ordinary functions.
The purrr syntax makes it convenient to define functions on the fly:
library("purrr")
# Mapping an ordinary function:
map_dbl(mtcars, function(input) mean(input, na.rm = TRUE))
# Mapping with the formula syntax:
map_dbl(mtcars, ~ mean(., na.rm = TRUE))
Like funs()
, formulas create functions where .
represents the input. Another pronoun for the input is .x
. It is especially useful for mapping two arguments at the same time because the second one can be referred by .y
. The scoped verbs in dplyr only support mapping one input so it is fine to keep using the .
pronoun.
Changing your code to use purrr lambdas should be straightforward in most cases.