Patterns of dplyr operations

The main design goal of dplyr is to be intuitive. Ideally, it should be obvious from a glance what a piece of dplyr is about, and what it does. But despite this apparent simplicitly, many different things happen under the hood. This vignette explains how each dplyr verb behaves to help you become more proficient in writing R and tidyverse code.

The dplyr verbs can be categorised along three main properties:

Actions and selections

Actions and selections are two kinds of operations in dplyr and the tidyverse. We call operations the arguments supplied by the user that either create new columns or vectors, or select them from a data frame. An operation may be very simple. For instance, the following examples feature two operations eye_color and hair_color:

starwars %>% arrange(mass, hair_color)
#> # A tibble: 87 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Ratt…     79    15 none       grey, blue unknown           NA male  
#> 2 Yoda      66    17 white      green      brown            896 male  
#> 3 Wick…     88    20 brown      brown      brown              8 male  
#> 4 R2-D2     96    32 <NA>       white, bl… red               33 <NA>  
#> # … with 83 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

starwars %>% select(mass, hair_color)
#> # A tibble: 87 x 2
#>    mass hair_color
#>   <dbl> <chr>     
#> 1    77 blond     
#> 2    75 <NA>      
#> 3    32 <NA>      
#> 4   136 none      
#> # … with 83 more rows

But really, the operations can become as complex as you like:

starwars %>% arrange(-mass, desc(paste(hair_color, eye_color)))
#> # A tibble: 87 x 13
#>   name  height  mass hair_color skin_color eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> 
#> 1 Jabb…    175  1358 <NA>       green-tan… orange         600   herma…
#> 2 Grie…    216   159 none       brown, wh… green, y…       NA   male  
#> 3 IG-88    200   140 none       metal      red             15   none  
#> 4 Dart…    202   136 none       white      yellow          41.9 male  
#> # … with 83 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

starwars %>% select(1 + 2, intersect(starts_with("hair"), ends_with("color")))
#> # A tibble: 87 x 2
#>    mass hair_color
#>   <dbl> <chr>     
#> 1    77 blond     
#> 2    75 <NA>      
#> 3    32 <NA>      
#> 4   136 none      
#> # … with 83 more rows

While all operations look like R code, they are actually interpreted a little differently, depending on the dplyr verb at the receiving end. The key too be effective at writing more complex dplyr operations is to understand the difference between two kinds of operations, actions and selections.

Actions

Actions are the most common flavour of dplyr operations. They behave just like any ordinary R code, with a few added features.

Data creation

All actions create new data. To illustrate this, we use transmute(), a dplyr verb that takes actions and returns the result of those actions, preserving the shape of the data frame:

We get a one-column data frame containing the result of 1 + 2, recycled to the total data frame size, and given an automatic name to remind how the data was created.

Data masking

Actions are computed in the context of your data frame. This means you can refer to your columns as if they were actual objects in your workspace. We call this data masking because the columns have precedence over other objects in your workspace, i.e. they mask these objects.

Selections

The usage of dplyr::select() looks similar to other verbs. In some cases, you’ll get exactly the same results:

However select() actually uses a different mechanism under the hood. The arguments to select() are selections, not actions. Whereas action verbs take entire vectors as arguments, selecting verbs take column names or column positions. Observe the difference:

The key feature of selecting verbs is that column names represent their own position inside the data frame. From the persective of select(), height is the same as 2 because it is the second column in the starwars data frame. On the other hand, transmute() takes actions and interprets 2 as a new column recycled to the full data frame size.

There are two reasons for this difference.

  1. It makes it possible to select ranges of columns with start:end. For instance, name:mass expands to 1:3. This couldn’t work if name and mass represented the entire columns rather than just their positions.

  2. It allows overlapping selections. Take the sets of columns starting with the letter "h" and of columns ending with the suffix "_color":

    It is easy to select the union of those overlapping sets:

    To achieve this, select() needs unique identifiers for the columns, which the column positions provide.

Data expressions versus context expressions

The origins of actions and selections

Data masking

Actions are operations evaluated with a data mask. The notion of actions is embedded deep in the R language:

The low-level function eval() makes it easy to implement actions:

Tidyverse packages use tidy eval, an extension of base R data masking provided by the rlang package.