The main design goal of dplyr is to be intuitive. Ideally, it should be obvious from a glance what a piece of dplyr is about, and what it does. But despite this apparent simplicitly, many different things happen under the hood. This vignette explains how each dplyr verb behaves to help you become more proficient in writing R and tidyverse code.
The dplyr verbs can be categorised along three main properties:
How they reshape the data frame. Many verbs may create columns, other can remove columns. Some shrink the number of rows, others make it longer. Quite often, the shape does not change.
The type of data passed in by the user. Is the data created on the spot by computing new columns or is the data purely selected within the data frame? In the former case, the verb takes actions, in the latter case, it takes selections.
How the verb handle groups. Most of the time, grouped data frames produce different results than ungrouped data frames because the computations occur within group levels.
Actions and selections are two kinds of operations in dplyr and the tidyverse. We call operations the arguments supplied by the user that either create new columns or vectors, or select them from a data frame. An operation may be very simple. For instance, the following examples feature two operations eye_color
and hair_color
:
starwars %>% arrange(mass, hair_color)
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Ratt… 79 15 none grey, blue unknown NA male
#> 2 Yoda 66 17 white green brown 896 male
#> 3 Wick… 88 20 brown brown brown 8 male
#> 4 R2-D2 96 32 <NA> white, bl… red 33 <NA>
#> # … with 83 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>% select(mass, hair_color)
#> # A tibble: 87 x 2
#> mass hair_color
#> <dbl> <chr>
#> 1 77 blond
#> 2 75 <NA>
#> 3 32 <NA>
#> 4 136 none
#> # … with 83 more rows
But really, the operations can become as complex as you like:
starwars %>% arrange(-mass, desc(paste(hair_color, eye_color)))
#> # A tibble: 87 x 13
#> name height mass hair_color skin_color eye_color birth_year gender
#> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 Jabb… 175 1358 <NA> green-tan… orange 600 herma…
#> 2 Grie… 216 159 none brown, wh… green, y… NA male
#> 3 IG-88 200 140 none metal red 15 none
#> 4 Dart… 202 136 none white yellow 41.9 male
#> # … with 83 more rows, and 5 more variables: homeworld <chr>,
#> # species <chr>, films <list>, vehicles <list>, starships <list>
starwars %>% select(1 + 2, intersect(starts_with("hair"), ends_with("color")))
#> # A tibble: 87 x 2
#> mass hair_color
#> <dbl> <chr>
#> 1 77 blond
#> 2 75 <NA>
#> 3 32 <NA>
#> 4 136 none
#> # … with 83 more rows
While all operations look like R code, they are actually interpreted a little differently, depending on the dplyr verb at the receiving end. The key too be effective at writing more complex dplyr operations is to understand the difference between two kinds of operations, actions and selections.
Actions are the most common flavour of dplyr operations. They behave just like any ordinary R code, with a few added features.
All actions create new data. To illustrate this, we use transmute()
, a dplyr verb that takes actions and returns the result of those actions, preserving the shape of the data frame:
starwars %>% transmute(1 + 2)
#> # A tibble: 87 x 1
#> `1 + 2`
#> <dbl>
#> 1 3
#> 2 3
#> 3 3
#> 4 3
#> # … with 83 more rows
We get a one-column data frame containing the result of 1 + 2
, recycled to the total data frame size, and given an automatic name to remind how the data was created.
Actions are computed in the context of your data frame. This means you can refer to your columns as if they were actual objects in your workspace. We call this data masking because the columns have precedence over other objects in your workspace, i.e. they mask these objects.
When you supply a grouped tibble, actions are automatically computed within groups. This is a dplyr feature that isn’t available in all data-masking APIs, for instance actions supplied to ggplot2::aes()
are not computed within groups.
# Standardising by the population standard deviation
starwars %>% transmute(height = height / sd(height, na.rm = TRUE))
#> # A tibble: 87 x 1
#> height
#> <dbl>
#> 1 4.95
#> 2 4.80
#> 3 2.76
#> 4 5.81
#> # … with 83 more rows
# Standardising by within-group standard deviations
starwars %>% group_by(eye_color) %>% transmute(height = height / sd(height, na.rm = TRUE))
#> # A tibble: 87 x 2
#> # Groups: eye_color [15]
#> eye_color height
#> <chr> <dbl>
#> 1 blue 8.46
#> 2 yellow 3.96
#> 3 red 1.80
#> 4 yellow 4.78
#> # … with 83 more rows
The usage of dplyr::select()
looks similar to other verbs. In some cases, you’ll get exactly the same results:
starwars %>% select(height)
#> # A tibble: 87 x 1
#> height
#> <int>
#> 1 172
#> 2 167
#> 3 96
#> 4 202
#> # … with 83 more rows
starwars %>% transmute(height)
#> # A tibble: 87 x 1
#> height
#> <int>
#> 1 172
#> 2 167
#> 3 96
#> 4 202
#> # … with 83 more rows
However select()
actually uses a different mechanism under the hood. The arguments to select()
are selections, not actions. Whereas action verbs take entire vectors as arguments, selecting verbs take column names or column positions. Observe the difference:
starwars %>% select(2)
#> # A tibble: 87 x 1
#> height
#> <int>
#> 1 172
#> 2 167
#> 3 96
#> 4 202
#> # … with 83 more rows
starwars %>% transmute(2)
#> # A tibble: 87 x 1
#> `2`
#> <dbl>
#> 1 2
#> 2 2
#> 3 2
#> 4 2
#> # … with 83 more rows
The key feature of selecting verbs is that column names represent their own position inside the data frame. From the persective of select()
, height
is the same as 2
because it is the second column in the starwars
data frame. On the other hand, transmute()
takes actions and interprets 2
as a new column recycled to the full data frame size.
There are two reasons for this difference.
It makes it possible to select ranges of columns with start:end
. For instance, name:mass
expands to 1:3
. This couldn’t work if name
and mass
represented the entire columns rather than just their positions.
It allows overlapping selections. Take the sets of columns starting with the letter "h"
and of columns ending with the suffix "_color"
:
starwars %>% select(starts_with("h")) %>% names()
#> [1] "height" "hair_color" "homeworld"
starwars %>% select(ends_with("_color")) %>% names()
#> [1] "hair_color" "skin_color" "eye_color"
It is easy to select the union of those overlapping sets:
To achieve this, select()
needs unique identifiers for the columns, which the column positions provide.
Actions are operations evaluated with a data mask. The notion of actions is embedded deep in the R language:
The low-level function eval()
makes it easy to implement actions:
expr <- quote(mass / 100)
result <- eval(expr, starwars)
head(result)
#> [1] 0.77 0.75 0.32 1.36 0.49 1.20
Tidyverse packages use tidy eval, an extension of base R data masking provided by the rlang package.
select
argumentSelections were introduced in base R with the select
argument of base::subset()
:
subset(starwars, select = name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # … with 83 more rows
Selections are implemented with a special data mask that contains column positions:
mask <- as.list(seq_along(starwars))
names(mask) <- names(starwars)
str(mask)
#> List of 13
#> $ name : int 1
#> $ height : int 2
#> $ mass : int 3
#> $ hair_color: int 4
#> $ skin_color: int 5
#> $ eye_color : int 6
#> $ birth_year: int 7
#> $ gender : int 8
#> $ homeworld : int 9
#> $ species : int 10
#> $ films : int 11
#> $ vehicles : int 12
#> $ starships : int 13