Applying operations over different columns

With group_by(), dplyr makes it easy to apply operations over different groups of rows:

starwars %>%
  group_by(gender) %>%
  summarise(mean(height, na.rm = TRUE))
#> # A tibble: 5 x 2
#>   gender        `mean(height, na.rm = TRUE)`
#>   <chr>                                <dbl>
#> 1 <NA>                                  120 
#> 2 female                                165.
#> 3 hermaphrodite                         175 
#> 4 male                                  179.
#> # … with 1 more row

Sometimes it is also useful to apply the same operation over different columns. We need a different approach in that case, we need to apply, or map functions over the columns. This requires a little more programming knowledge because you need to wrap your operation in a function and pass that function to dplyr verbs. However, this knowledge is widely reusable in base R (apply family) and the tidyverse (map family). To brush up on the notion of iteration with functions, read the iteration chapter of the online book R4DS.

dplyr provides variants of the main data manipulation verbs that map functions over a selection of columns. These verbs are known as the scoped variants and are recognizable from their _at, _if and _all suffixes.

Selecting columns

Scoped verbs support three sorts of selection:

  1. _all verbs operate on all columns of the data frame. You can summarise all columns of a data frame within groups with summarise_all():

  2. _if verbs operate conditionally, on all columns for which a predicate returns TRUE. If you are familiar with purrr, the idea is similar to the conditional mapper purrr::map_if(). Promoting all character columns of a data frame as grouping variables is as simple as:

  3. _at verbs operate on a selection of columns. You can supply integer vectors of column positions or character vectors of colunm names.

    More interestingly, you can use vars()[^fn:vars] to supply the same sort of expressions you would pass to select()! The selection helpers make it very convenient to craft a selection of columns to map over.

The scoped variants of mutate() and summarise() are the closest analogue to base::lapply() and purrr::map(). Unlike pure list mappers, the scoped verbs fully implement the dplyr semantics, such as groupwise vectorisation or the summary constraints:

# map() returns a simple list with the results
mtcars[1:5] %>% purrr::map(mean)
#> $mpg
#> [1] 20.09062
#> 
#> $cyl
#> [1] 6.1875
#> 
#> $disp
#> [1] 230.7219
#> 
#> $hp
#> [1] 146.6875
#> 
#> $drat
#> [1] 3.596563

# `mutate_` variants recycle to group size
mtcars[1:5] %>% mutate_all(mean)
#> # A tibble: 32 x 5
#>     mpg   cyl  disp    hp  drat
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60
#> 2  20.1  6.19  231.  147.  3.60
#> 3  20.1  6.19  231.  147.  3.60
#> 4  20.1  6.19  231.  147.  3.60
#> # … with 28 more rows

# `summarise_` variants enforce a size 1 constraint
mtcars[1:5] %>% summarise_all(mean)
#> # A tibble: 1 x 5
#>     mpg   cyl  disp    hp  drat
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  20.1  6.19  231.  147.  3.60

# All scoped verbs know about groups
mtcars[1:5] %>% group_by(cyl) %>% summarise_all(mean)
#> # A tibble: 3 x 5
#>     cyl   mpg  disp    hp  drat
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  26.7  105.  82.6  4.07
#> 2     6  19.7  183. 122.   3.59
#> 3     8  15.1  353. 209.   3.23

The other scoped variants also accept optional functions to map over the selection of columns. For instance, you could group by a selection of variables and transform them on the fly:

iris %>% group_by_if(is.factor, as.character)
#> # A tibble: 150 x 5
#> # Groups:   Species [3]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> 4          4.6         3.1          1.5         0.2 setosa 
#> # … with 146 more rows

or transform the column names of selected variables:

storms %>% select_at(vars(name:hour), toupper)
#> # A tibble: 10,010 x 5
#>   NAME   YEAR MONTH   DAY  HOUR
#>   <chr> <dbl> <dbl> <int> <dbl>
#> 1 Amy    1975     6    27     0
#> 2 Amy    1975     6    27     6
#> 3 Amy    1975     6    27    12
#> 4 Amy    1975     6    27    18
#> # … with 10,006 more rows

The scoped variants lie at the intersection of purrr and dplyr and combine the rowwise looping mechanisms of dplyr with the columnwise mapping of purrr. This is a powerful combination.

Transitioning from funs() to purrr-style formulas

Historically, mapping functions has been done through a special syntax that was unique to dplyr. The funs() helper would take quoted expression in which . represented the function input. In order to make the tidyverse more consistent and to reduce the variety of syntax to learn, we have deprecated this idiosyncratic interface in favour of purrr-style formulas and ordinary functions.

The purrr syntax makes it convenient to define functions on the fly:

library("purrr")

# Mapping an ordinary function:
map_dbl(mtcars, function(input) mean(input, na.rm = TRUE))

# Mapping with the formula syntax:
map_dbl(mtcars, ~ mean(., na.rm = TRUE))

Like funs(), formulas create functions where . represents the input. Another pronoun for the input is .x. It is especially useful for mapping two arguments at the same time because the second one can be referred by .y. The scoped verbs in dplyr only support mapping one input so it is fine to keep using the . pronoun.

Changing your code to use purrr lambdas should be straightforward in most cases.

NA
#> [1] NA