class: title-slide # An introduction to the tidyverse ## tidyr, dplyr and purrr ### Aitor Ameztegui .blue[@multivac42] .font70[University of Lleida] ### Víctor Granda .blue[@MalditoBarbudo] .font70[Joint Research Unit CTFC-CREAF] #### **I SIBECOL meeting** (Barcelona, 4 de Febrero de 2019) --- layout: true <div class="tweaked-header" style="background-image: url(resources/images/tidyverse.png)"></div> --- # Who are we?  --- # Who are we?  --- background-image: url(resources/images/hadley.png) background-position: right bottom # The tidyverse The **tidyverse** is a collection of R packages designed for data science, as a suite aimed at easening the data analysis in all its steps. Created by Hadley Wickham, chief scientist of RStudio, and author of more than 30 R packages (`readr`, `ggplot2`, `plyr`, `devtools`, `roxygen2`, `rmarkdown`...) All packages share an underlying design philosophy, grammar, and data structures. --  --- background-image: url(resources/images/hadley.png) background-position: right bottom # The tidyverse The **tidyverse** is a collection of R packages designed for data science, as a suite aimed at easening the data analysis in all its steps. Created by Hadley Wickham, chief scientist of RStudio, and author of more than 30 R packages (`readr`, `ggplot2`, `plyr`, `devtools`, `roxygen2`, `rmarkdown`...) All packages share an underlying design philosophy, grammar, and data structures.  --- # *tidyverse*: tidy data  -- - Data in **tidy** format eases the processing and analysis, particularly in vectorized languages as R. --- # So what's exactly *in* the tidyverse? .pull-extleft[] .pull-extleft_right[ * `ggplot2` a system for creating graphics, based on the Grammar of Graphics * `readr` a fast and friendly way to read rectangular data (csv, txt...) * `tibble` a tibble is a re-imagining version of the data frame, keeping what time has proven to be effective and throwing out what has not * `stringr` provides a cohesive set of functions designed to make working with strings as easy as possible * `forcats` provides a suite of useful tools that solve common problems with factors * `dplyr` provides a grammar of data manipulation, providing a consistent set of verbs that solve the most common data manipulation challenges * `tidyr` provides a set of functions that help you get to tidy data * `purrr` enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors ] --- # Installation and use - Install all the packages in the tidyverse by running `install.packages("tidyverse")` - Run `library(tidyverse)` to load the core tidyverse and make it available in your current R session. - Learn more about the tidyverse package at http://tidyverse.tidyverse.org. - Or check the cheatsheets  --- # Before we start... - Neither `tidyr`, nor `dplyr` or `purrr` do anything that can't be done with base R code, `apply` family functions, `for` loops or other packages. - Designed to be more efficient (in time), easier to read and easier to use. More intuitive to use, specially for beginners (it may require some adaptation if you are used to base R code). - Valid mostly for data.frames. For other formats (matrices, arrays) `plyr` can be used. --- # Our data 1. `plots [11858 x 15]`: all plots from the Third Spanish Forest Inventory (IFN3) in Catalonia 2. `trees [111756 x 12]`: all trees with dbh > 7.5 cm measured in both IFN2 and IFN3 3. `species [14778 x 15]`: number of trees per hectare in each plot, by species and size class 4. `coordinates [11858 x 6]`: X and Y UTM coordinates of each plot. 5. `leaf [10447 x 3]`: leaf biomass and carbon content for those IFN 3 plots where they were available --- # let's have a look at the data ```r trees ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9 ## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9 ## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11 ## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5 ## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13 ## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10 ## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6 ## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7 ## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5 ## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` --- # let's have a look at the data ```r plots ``` ``` ## # A tibble: 11,858 x 15 ## Codi Provincia Cla Subclase FccTot FccArb FechaIni HoraIni ## <fct> <chr> <fct> <fct> <int> <int> <date> <dttm> ## 1 0800… 08 A 1 80 70 2001-07-09 2017-11-26 09:44:00 ## 2 0800… 08 A 1 80 70 2001-08-06 2017-11-26 09:18:58 ## 3 0800… 08 A 1 90 80 2001-08-06 2017-11-26 12:08:09 ## 4 0800… 08 A 1 90 50 2001-07-09 2017-11-26 13:23:23 ## 5 0800… 08 A 1 70 60 2001-08-03 2017-11-26 09:11:28 ## 6 0800… 08 A 1 90 90 2001-08-01 2017-11-26 13:00:33 ## 7 0800… 08 A 1 90 90 2001-08-07 2017-11-26 10:08:15 ## 8 0800… 08 A 1 70 60 2001-08-03 2017-11-26 12:12:03 ## 9 0800… 08 A 1 80 70 2001-08-02 2017-11-26 09:00:16 ## 10 0800… 08 A 1 80 80 2001-06-14 2017-11-26 12:34:21 ## # … with 11,848 more rows, and 7 more variables: FechaFin <date>, HoraFin <dttm>, ## # Rocosid <int>, Textura <int>, MatOrg <int>, PhSuelo <int>, FechaPh <date> ``` --- # let's have a look at the data ```r species ``` ``` ## # A tibble: 14,778 x 15 ## # Groups: Codi, Especie [14,778] ## Codi Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60 ## * <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 022 0 159. 31.8 111. 60.1 19.2 5.09 0 0 0 0 ## 2 080002 021 0 0 0 0 0 74.2 28.3 63.7 0 0 0 ## 3 080002 022 0 0 0 173. 31.8 0 0 0 0 0 0 ## 4 080002 476 0 0 0 0 0 31.8 0 0 0 0 0 ## 5 080003 021 0 0 0 31.8 0 0 0 0 5.09 0 0 ## 6 080003 022 0 127. 0 0 46.0 127. 0 14.1 0 0 0 ## 7 080004 021 0 31.8 0 0 31.8 0 0 0 0 0 0 ## 8 080005 071 0 0 0 14.1 5.09 31.8 0 0 0 0 31.8 ## 9 080005 243 0 0 14.1 0 14.1 0 5.09 0 0 0 0 ## 10 080005 278 0 286. 31.8 0 0 31.8 0 0 0 0 0 ## # … with 14,768 more rows, and 2 more variables: CD_65 <dbl>, CD_70 <dbl> ``` --- # let's have a look at the data ```r class(trees) ``` ``` ## [1] "tbl_df" "tbl" "data.frame" ``` Tibbles, not usual data.frames: - class `tbl_df` - print only 10 rows by default - informs about variable types - Besides that, it is like a data.frame (it *is* a data.frame) --- # let's have a look at the data ```r glimpse(trees) ``` ``` ## Observations: 111,756 ## Variables: 10 ## $ Codi <fct> 080001, 080002, 080003, 080004, 080006, 080007, 080008, 080009, … ## $ Provincia <chr> "08", "08", "08", "08", "08", "08", "08", "08", "08", "08", "08"… ## $ Especie <fct> 022, 476, 021, 021, 021, 021, 243, 045, 243, 022, 021, 021, 071,… ## $ Rumbo <dbl> 7, 38, 25, 28, 19, 32, 40, 16, 47, 44, 13, 9, 9, 25, 199, 6, 43,… ## $ Dist <dbl> 8.30, 9.10, 7.00, 8.89, 11.19, 12.00, 7.80, 5.09, 26.89, 2.70, 8… ## $ N <dbl> 31.83, 31.83, 31.83, 31.83, 14.14, 14.14, 31.83, 31.83, 5.09, 12… ## $ CD <dbl> 20, 35, 25, 15, 35, 35, 15, 20, 65, 15, 20, 30, 45, 20, 35, 30, … ## $ DiamIf3 <dbl> 20.30, 34.00, 24.80, 16.85, 34.05, 33.10, 15.00, 17.50, 67.40, 1… ## $ DiamIf2 <dbl> 18.90, 32.45, 17.55, 12.65, 30.90, 28.15, 13.25, 15.30, 66.80, 1… ## $ HeiIf3 <dbl> 9.00, 9.00, 11.00, 9.50, 13.00, 10.00, 6.00, 7.00, 16.50, 9.50, … ``` --- layout: false class: inverse background-image: url(resources/images/dplyr.png) # dplyr --- layout: true <div class="tweaked-header" style="background-image: url(resources/images/dplyr.png)"></div> --- # 5 main verbs of dplyr - `filter`: keep the rows that match a condition - `select`: keep columns by name - `arrange`: sort rows - `mutate`: transform existent variables or create new ones - `summarise`: do some summary statistics and reduce data --- # common structure ## (for most of the tidyverse) ```r verb(data, ...) ``` - first argument: data (as data.frame or tbl_df) - the rest of arguments specify what to do with the data frame - output is always another data frame (tbl_df or data.frame) - unless we are assigning (`<-`), never modifies the original data frame --- .middle[.center[.font200[ `filter` ]]] .center[] --- # Selecting rows (`filter`) ```r filter(trees, Dist < 3) ``` ``` ## # A tibble: 11,601 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5 ## 2 080015 08 021 9 1 127. 30 28.1 25.1 15.2 ## 3 080027 08 021 2 0.6 127. 25 25.5 23.1 10 ## 4 080034 08 021 17 2.59 127. 15 13.4 9.4 6.09 ## 5 080051 08 022 5 2.29 127. 20 18.4 13.2 8.5 ## 6 080065 08 022 10 2.09 127. 15 15.6 14.5 12 ## 7 080118 08 476 3 2.29 127. 10 12.2 8.9 10.4 ## 8 080188 08 243 24 2.7 127. 10 12.2 10.8 6.4 ## 9 080197 08 243 103 2.79 127. 10 10.7 8.2 6.59 ## 10 080198 08 021 2 2 127. 25 26.5 25.1 11.2 ## # … with 11,591 more rows ``` --- # Selecting rows (`filter`) ```r filter(trees, Provincia == '25') ``` ``` ## # A tibble: 35,665 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 250001 25 071 59 29.3 5.09 70 71 69.4 21.1 ## 2 250002 25 071 18 25.5 5.09 50 51.4 46.4 19.8 ## 3 250004 25 042 19 22.4 5.09 70 74.8 71 12.7 ## 4 250005 25 255 26 7.3 31.8 30 32 27.0 17 ## 5 250007 25 071 40 8.89 31.8 15 16.0 12.9 9.69 ## 6 250008 25 042 203 3.79 127. 15 14.0 8.8 5.59 ## 7 250010 25 071 34 7.4 31.8 50 50.3 47.3 23.3 ## 8 250011 25 031 51 21.5 5.09 55 55.0 48.2 26.2 ## 9 250012 25 071 4 13.5 14.1 50 52.4 50.8 16.8 ## 10 250013 25 073 40 10.3 14.1 20 22.4 15.8 11 ## # … with 35,655 more rows ``` --- # Selecting rows (`filter`) ```r filter(trees, CD %in% c(45, 70)) ``` ``` ## # A tibble: 2,552 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080016 08 071 9 26.9 5.09 45 46.0 45 19 ## 2 080113 08 021 19 11.1 14.1 70 72.6 70.7 20.9 ## 3 080686 08 042 43 7.09 31.8 45 44.3 42.5 12.5 ## 4 080721 08 042 16 15.8 5.09 45 46.2 39.8 17.5 ## 5 080743 08 042 3 9.69 31.8 45 46.1 40.7 25.5 ## 6 081271 08 024 93 10.8 14.1 45 44.0 39.0 17.3 ## 7 081278 08 024 87 10.1 14.1 45 44.6 38.9 15.1 ## 8 081354 08 026 16 8.6 31.8 45 43.4 30.8 16.9 ## 9 081402 08 054 51 5.19 31.8 70 102. 95.5 17.7 ## 10 081943 08 024 1 10.3 14.1 45 43.2 33.8 11.7 ## # … with 2,542 more rows ``` --- # Selecting rows (`filter`)  --- # Selecting rows (`filter`) ## Exercise 1 Let's find those plots in IFN3n (`plots` data frame) that: 1.1 Are located either in Barcelona (08) or Girona (17) 1.2 Were measured **in** January 2001 1.3 It took **more** than 2 hours to measure (7200s) --- .middle[.center[.font200[ `select` ]]] .center[] --- # Selecting columns (`select`) ```r trees ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9 ## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9 ## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11 ## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5 ## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13 ## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10 ## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6 ## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7 ## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5 ## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` --- # Selecting columns (`select`) ```r select(trees, DiamIf3) ``` ``` ## # A tibble: 111,756 x 1 ## DiamIf3 ## <dbl> ## 1 20.3 ## 2 34 ## 3 24.8 ## 4 16.8 ## 5 34.0 ## 6 33.1 ## 7 15 ## 8 17.5 ## 9 67.4 ## 10 15.1 ## # … with 111,746 more rows ``` --- # Selecting columns (`select`) ```r select(trees, -Codi) ``` ``` ## # A tibble: 111,756 x 9 ## Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 08 022 7 8.3 31.8 20 20.3 18.9 9 ## 2 08 476 38 9.1 31.8 35 34 32.4 9 ## 3 08 021 25 7 31.8 25 24.8 17.6 11 ## 4 08 021 28 8.89 31.8 15 16.8 12.6 9.5 ## 5 08 021 19 11.2 14.1 35 34.0 30.9 13 ## 6 08 021 32 12 14.1 35 33.1 28.2 10 ## 7 08 243 40 7.8 31.8 15 15 13.2 6 ## 8 08 045 16 5.09 31.8 20 17.5 15.3 7 ## 9 08 243 47 26.9 5.09 65 67.4 66.8 16.5 ## 10 08 022 44 2.7 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` --- # Selecting columns (`select`) ```r select(trees, DiamIf2, DiamIf3) ``` ``` ## # A tibble: 111,756 x 2 ## DiamIf2 DiamIf3 ## <dbl> <dbl> ## 1 18.9 20.3 ## 2 32.4 34 ## 3 17.6 24.8 ## 4 12.6 16.8 ## 5 30.9 34.0 ## 6 28.2 33.1 ## 7 13.2 15 ## 8 15.3 17.5 ## 9 66.8 67.4 ## 10 12.6 15.1 ## # … with 111,746 more rows ``` --- # Selecting columns (`select`) ```r select(trees, Codi:Dist) ``` ``` ## # A tibble: 111,756 x 5 ## Codi Provincia Especie Rumbo Dist ## <fct> <chr> <fct> <dbl> <dbl> ## 1 080001 08 022 7 8.3 ## 2 080002 08 476 38 9.1 ## 3 080003 08 021 25 7 ## 4 080004 08 021 28 8.89 ## 5 080006 08 021 19 11.2 ## 6 080007 08 021 32 12 ## 7 080008 08 243 40 7.8 ## 8 080009 08 045 16 5.09 ## 9 080010 08 243 47 26.9 ## 10 080013 08 022 44 2.7 ## # … with 111,746 more rows ``` --- # Selecting columns (`select`) ## Special functions: - `starts_with(x)`: names that start with x - `ends_with(x)`: names that end with x - `contains(x)`: selects all variables whose name contains x - `matches(x)`: selects all variables whose name contains the regular expression x - `num_range("x", 1:5, width = 2)`: selects all variables (numerically) from x01 to x05 - `one_of ("x", "y", "z")`: selects variables provided in a character vector - `everything()`: selects all variables --- # Selecting columns (`select`) ```r select(trees, starts_with('Diam')) ``` ``` ## # A tibble: 111,756 x 2 ## DiamIf3 DiamIf2 ## <dbl> <dbl> ## 1 20.3 18.9 ## 2 34 32.4 ## 3 24.8 17.6 ## 4 16.8 12.6 ## 5 34.0 30.9 ## 6 33.1 28.2 ## 7 15 13.2 ## 8 17.5 15.3 ## 9 67.4 66.8 ## 10 15.1 12.6 ## # … with 111,746 more rows ``` --- # Selecting columns (`select`) ## Exercise 2 Think of three or four ways to select the variables that define the start and finish date of plot measuring --- .middle[.center[.font200[ `arrange` ]]] .center[] --- # Sorting rows (`arrange`) ```r arrange(trees, Dist) ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 170672 17 021 73 0 127. 15 17.2 15.2 13.5 ## 2 251882 25 025 174 0.3 127. 15 13.5 10.4 6 ## 3 170640 17 045 290 0.3 127. 10 11.4 11.2 6.5 ## 4 081544 08 025 176 0.3 127. 20 17.6 14.4 7 ## 5 081080 08 025 107 0.4 127. 15 13.8 10.2 8 ## 6 430170 43 024 0 0.4 127. 25 23.4 15.0 6.5 ## 7 430631 43 021 17 0.4 127. 15 15.6 14 10.7 ## 8 171941 17 079 150 0.4 127. 20 20.0 10.6 20 ## 9 171990 17 045 34 0.4 127. 25 23.2 19.6 8 ## 10 172056 17 045 160 0.4 127. 15 12.7 11.3 8 ## # … with 111,746 more rows ``` --- # Sorting rows (`arrange`) ```r arrange(trees, desc(Dist)) ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 250008 25 042 398 36.5 5.09 70 74.8 72.3 11.1 ## 2 250025 25 031 350 34.8 5.09 70 75.1 71.3 35.7 ## 3 250135 25 022 158 33.9 5.09 50 51.8 54.0 9 ## 4 250051 25 021 110 33.6 5.09 60 58.2 56.8 6.5 ## 5 250025 25 031 238 33.5 5.09 50 51 45.5 20.3 ## 6 250010 25 071 214 32.9 5.09 70 85.3 71.6 21 ## 7 251788 25 045 163 32.8 5.09 60 61.6 51.7 10.5 ## 8 250039 25 031 119 32.5 5.09 65 63.5 62.6 30 ## 9 250025 25 031 390 32.1 5.09 45 45.6 43.8 24.6 ## 10 171289 17 071 38 31.4 5.09 60 60.8 50.4 26.9 ## # … with 111,746 more rows ``` --- # Sorting rows (`arrange`) ## Exercise 3 3.1 Sort plots by date and hour of measurement 3.2 Which plots were started to be measured later in the day? 3.3 Which plots took longer to be measured? --- .middle[.center[.font200[ `mutate` ]]] .center[] --- # Transforming variables (`mutate`) ```r mutate( trees, Dist = Dist * 100 ) ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 022 7 830. 31.8 20 20.3 18.9 9 ## 2 080002 08 476 38 910 31.8 35 34 32.4 9 ## 3 080003 08 021 25 700 31.8 25 24.8 17.6 11 ## 4 080004 08 021 28 889 31.8 15 16.8 12.6 9.5 ## 5 080006 08 021 19 1119 14.1 35 34.0 30.9 13 ## 6 080007 08 021 32 1200 14.1 35 33.1 28.2 10 ## 7 080008 08 243 40 780 31.8 15 15 13.2 6 ## 8 080009 08 045 16 509 31.8 20 17.5 15.3 7 ## 9 080010 08 243 47 2689 5.09 65 67.4 66.8 16.5 ## 10 080013 08 022 44 270 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` --- # Transforming variables (`mutate`) ```r mutate( trees, Alometria = DiamIf3 / HeiIf3, Alometria2 = Alometria * DiamIf2 ) ``` ``` ## # A tibble: 111,756 x 12 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 Alometria ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0800… 08 022 7 8.3 31.8 20 20.3 18.9 9 2.26 ## 2 0800… 08 476 38 9.1 31.8 35 34 32.4 9 3.78 ## 3 0800… 08 021 25 7 31.8 25 24.8 17.6 11 2.25 ## 4 0800… 08 021 28 8.89 31.8 15 16.8 12.6 9.5 1.77 ## 5 0800… 08 021 19 11.2 14.1 35 34.0 30.9 13 2.62 ## 6 0800… 08 021 32 12 14.1 35 33.1 28.2 10 3.31 ## 7 0800… 08 243 40 7.8 31.8 15 15 13.2 6 2.5 ## 8 0800… 08 045 16 5.09 31.8 20 17.5 15.3 7 2.5 ## 9 0800… 08 243 47 26.9 5.09 65 67.4 66.8 16.5 4.08 ## 10 0800… 08 022 44 2.7 127. 15 15.1 12.6 9.5 1.59 ## # … with 111,746 more rows, and 1 more variable: Alometria2 <dbl> ``` --- # Transforming variables (`mutate`) ## Special functions: - `if_else` ```r mutate( trees, Especie = if_else(Especie == '021', 'Pinus sylvestris', 'Other') ) ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 Other 7 8.3 31.8 20 20.3 18.9 9 ## 2 080002 08 Other 38 9.1 31.8 35 34 32.4 9 ## 3 080003 08 Pinus sylvestris 25 7 31.8 25 24.8 17.6 11 ## 4 080004 08 Pinus sylvestris 28 8.89 31.8 15 16.8 12.6 9.5 ## 5 080006 08 Pinus sylvestris 19 11.2 14.1 35 34.0 30.9 13 ## 6 080007 08 Pinus sylvestris 32 12 14.1 35 33.1 28.2 10 ## 7 080008 08 Other 40 7.8 31.8 15 15 13.2 6 ## 8 080009 08 Other 16 5.09 31.8 20 17.5 15.3 7 ## 9 080010 08 Other 47 26.9 5.09 65 67.4 66.8 16.5 ## 10 080013 08 Other 44 2.7 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` --- # Transforming variables (`mutate`) ## Exercise 4 4.1 Get growth (in cm) of each tree between IFN2 and IFN3 4.2 Create two new variables with basal area of each tree (in `\(m^2\)` per hectare), both for IFN2 and IFN3. Which is the species of the fastest-growing tree in basal area? <br> .center[Clue:] $$ AB = \frac{\pi}{4} · Diam^{2} · N $$ --- .middle[.center[.font200[ `summarise` ]]] .center[] --- # Reducing variables (`summarise`) ```r summarise(trees, mean_if3 = mean(DiamIf3)) ``` ``` ## # A tibble: 1 x 1 ## mean_if3 ## <dbl> ## 1 23.4 ``` --- # Reducing variables (`summarise`) ## Summary functions - `min(x)`, `max(x)`, `quantile(x, p)` - `mean(x)`, `median(x)`, - `sd(x)`, `var(x)`, `IQR(x)` - `n()`, `n_distinct(x)` - `sum(x > 10)`, `mean(x > 10)` --- .middle[.center[.font200[ `grouped summarise` ]]] .center[] --- # Reducing variables (`summarise`) ## Grouped summarise ```r by_province <- group_by(trees, Provincia) by_province ``` ``` ## # A tibble: 111,756 x 10 ## # Groups: Provincia [4] ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9 ## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9 ## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11 ## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5 ## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13 ## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10 ## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6 ## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7 ## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5 ## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` --- # Reducing variables (`summarise`) ## Grouped summarise ```r summarise( by_province, mean_height_ifn3 = mean(HeiIf3, na.rm = TRUE), max_height_ifn3 = max(HeiIf3, na.rm = TRUE), min_height_ifn3 = min(HeiIf3, na.rm = TRUE) ) ``` ``` ## # A tibble: 4 x 4 ## Provincia mean_height_ifn3 max_height_ifn3 min_height_ifn3 ## <chr> <dbl> <dbl> <dbl> ## 1 08 11.2 35 1.6 ## 2 17 11.4 38 1.5 ## 3 25 11.4 35.8 2 ## 4 43 9.21 33 2 ``` --- # Reducing variables (`summarise`) ## Exercise 5 Which statistics would you calculate to characterize diameter values in each plot? --- class: code60 # Targeted transformations ## `summarise_if` and `summarise_at` We can apply a summarising function to a group of variables... - that share some commmon characteristic that can be tested (i.e. numeric variables) ```r summarise_if(trees, is.numeric, mean) ``` ``` ## # A tibble: 1 x 7 ## Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 198. 7.78 55.6 23.4 23.4 20.3 11.1 ``` - by name or using the select helpers (`starts_with`, `ends_with`, `one_of`) ```r summarise_at(trees, vars(starts_with('Diam')), mean) ``` ``` ## # A tibble: 1 x 2 ## DiamIf3 DiamIf2 ## <dbl> <dbl> ## 1 23.4 20.3 ``` --- class: code60 # Targeted transformations ## `mutate_if` and `mutate_at` The same can be done with `mutate`: .pull-left[ ```r mutate_if(trees, is.numeric, log) ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 022 1.95 2.12 3.46 3.00 3.01 2.94 2.20 ## 2 080002 08 476 3.64 2.21 3.46 3.56 3.53 3.48 2.20 ## 3 080003 08 021 3.22 1.95 3.46 3.22 3.21 2.87 2.40 ## 4 080004 08 021 3.33 2.18 3.46 2.71 2.82 2.54 2.25 ## 5 080006 08 021 2.94 2.42 2.65 3.56 3.53 3.43 2.56 ## 6 080007 08 021 3.47 2.48 2.65 3.56 3.50 3.34 2.30 ## 7 080008 08 243 3.69 2.05 3.46 2.71 2.71 2.58 1.79 ## 8 080009 08 045 2.77 1.63 3.46 3.00 2.86 2.73 1.95 ## 9 080010 08 243 3.85 3.29 1.63 4.17 4.21 4.20 2.80 ## 10 080013 08 022 3.78 0.993 4.85 2.71 2.71 2.53 2.25 ## # … with 111,746 more rows ``` ] .pull-right[ ```r mutate_at( trees, vars(one_of(c('Especie', 'Species'))), ~ paste0('sp_', .x) ) ``` ``` ## Warning: Unknown columns: `Species` ``` ``` ## # A tibble: 111,756 x 10 ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 ## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 sp_022 7 8.3 31.8 20 20.3 18.9 9 ## 2 080002 08 sp_476 38 9.1 31.8 35 34 32.4 9 ## 3 080003 08 sp_021 25 7 31.8 25 24.8 17.6 11 ## 4 080004 08 sp_021 28 8.89 31.8 15 16.8 12.6 9.5 ## 5 080006 08 sp_021 19 11.2 14.1 35 34.0 30.9 13 ## 6 080007 08 sp_021 32 12 14.1 35 33.1 28.2 10 ## 7 080008 08 sp_243 40 7.8 31.8 15 15 13.2 6 ## 8 080009 08 sp_045 16 5.09 31.8 20 17.5 15.3 7 ## 9 080010 08 sp_243 47 26.9 5.09 65 67.4 66.8 16.5 ## 10 080013 08 sp_022 44 2.7 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` ] --- layout: false class: inverse background-image: url(resources/images/pipe.png) # pipes --- layout: true <div class="tweaked-header" style="background-image: url(resources/images/pipe.png)"></div> --- class: code80 # Data pipelines (`%>%`) - Often, we want to use several verbs (filter, arrange, group_by, summarise...) - Multiple operations are difficult to read, or require to create multiple intermediate objects: .pull-left[ ```r diam_especie <- filter( summarise( group_by( filter( trees, !is.na(DiamIf3) ), Codi, Especie ), diam = mean(DiamIf3), n = n() ), n > 5 ) ``` ] .pull-right[ ```r no_na_trees <- filter( trees, !is.na(DiamIf3) ) no_na_trees_grouped <- group_by( no_na_trees, Codi, Especie ) summarised_no_na_trees <- summarise( no_na_trees_grouped, diam = mean(DiamIf3), n = n() ) final_data <- filter( summarised_no_na_trees, n > 5 ) ``` ] --- # Data pipelines (`%>%`) - Alternative (cleaner and easy to read): *pipe* operator (`%>%`) from `magrittr` package - The result of the left side is passed to the function in the right as first argument: `f(x, y)` is the same as `x %>% f(y)` `f(x, y, z)` is the same as `x %>% f(y, z)` - In the tidyverse `%>%` makes each function to be applied to the data frame resulting from the previous step `filter(df, color == 'blue')` is the same as `df %>% filter(color == 'blue')` `mutate(df, double = 2*value)` is the same as `df %>% mutate(double = 2*value)` --- class: code80 # Data pipelines (`%>%`) .pull-left[ Nested functions ```r diam_especie <- filter( summarise( group_by( filter( trees, !is.na(DiamIf3) ), Codi, Especie ), diam = mean(DiamIf3), n = n() ), n > 5 ) ``` ] -- .pull-right[ Pipeline ```r diam_especie <- trees %>% filter(!is.na(DiamIf3)) %>% group_by(Codi, Especie) %>% summarise( diam = mean(DiamIf3), n = n() ) %>% filter(n > 5) ``` ] --- # Data pipelines (`%>%`) ## Exercise 6 Create pipelines to answer the following questions: 6.1 Which **plots** have the fastest average growth rate? 6.2 Which is the plot with the **most species**? 6.3 Is there any **relationship** between both variables? <br> *(Optional, some knowledge on `ggplot`is required)* --- layout: true <div class="tweaked-header" style="background-image: url(resources/images/dplyr.png)"></div> --- .middle[.center[.font200[ `grouped mutate` ]]] .center[] --- class: code60 # Grouped `mutate`/`filter` We will commonly use groups (`group_by`) when summarising variables (*n* inputs, one output): ```r group_by(Especie) %>% summarise(mean = mean(Diam)) ``` .center[] Sometimes, however, we may be interested in calculating new variables by group, but without reducing the dimensions: .center[] --- class: code60 # Grouped `mutate`/`filter` Sometimes, however, we may be interested in calculating new variables by group, but without reducing the dimensions: ```r trees %>% group_by(Especie) %>% mutate( std_diam = DiamIf3 - mean(DiamIf3) ) ``` ``` ## # A tibble: 111,756 x 11 ## # Groups: Especie [91] ## Codi Provincia Especie Rumbo Dist N CD DiamIf3 DiamIf2 HeiIf3 std_diam ## <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 022 7 8.3 31.8 20 20.3 18.9 9 -6.71 ## 2 080002 08 476 38 9.1 31.8 35 34 32.4 9 13.5 ## 3 080003 08 021 25 7 31.8 25 24.8 17.6 11 -0.555 ## 4 080004 08 021 28 8.89 31.8 15 16.8 12.6 9.5 -8.50 ## 5 080006 08 021 19 11.2 14.1 35 34.0 30.9 13 8.70 ## 6 080007 08 021 32 12 14.1 35 33.1 28.2 10 7.75 ## 7 080008 08 243 40 7.8 31.8 15 15 13.2 6 -5.86 ## 8 080009 08 045 16 5.09 31.8 20 17.5 15.3 7 1.40 ## 9 080010 08 243 47 26.9 5.09 65 67.4 66.8 16.5 46.5 ## 10 080013 08 022 44 2.7 127. 15 15.1 12.6 9.5 -11.9 ## # … with 111,746 more rows ``` --- # Grouped `mutate`/`filter` ## Exercise 7 7.1 Identify those trees that grow most as compared to the average in that plot <br> .font80[(Hint: calculate growth, *then* mean growth by plot, and *then* the difference)] 7.2 Identify those plots where a species grows much more than the average for the species </br> </br> **Extra (in case you get bored):** 7.3 Select IFN plots with pure *Pinus nigra* stands (Especie = 025). Note: we consider a forest to be monospecific when > 80% in BA corresponds to a single species --- # Working with two tables .middle[##`*_join`] .middle[.center[]] --- # Joining data ## Mutating joins .pull-left[   ] .pull-right[ - `left_join(x, y)`: Add observations in y that also appears in x. Original observations (x) are not lost <br> <br> <br> - `right_join(x, y)`: Add observations in x that also appears in y. Original observations (y) are not lost ] --- # Joining data ## Mutating joins .pull-left[   ] .pull-right[ - `full_join(x, y)`: All observations, x and y <br> <br> <br> <br> - `inner_join(x, y)`: Only observations present in **both** x and y ] --- # Joining data ## Filtering joins They work as ‘mutating joins’, but they **affect the observations** THEY DO NOT ADD COLUMNS! .pull-left[   ] .pull-right[ - `semi_join(x, y)`: Keep observations in x that are present in y <br> <br> <br> <br> - `anti_join(x, y)`: Remove observations in x present in y ] --- # Joining data ## Exercise 8 Add X and Y coordinates that are included in the `coordinates` data frame to each plot in the `plots` data frame --- layout: false class: inverse background-image: url(resources/images/tidyr.png) # tidyr --- layout: true <div class="tweaked-header" style="background-image: url(resources/images/tidyr.png)"></div> --- # tidy data .center[  ] -- - Data in **tidy** format eases the processing and analysis, particularly in vectorized languages as R. --- # tidyr Data is not always organized... ``` ## # A tibble: 5,769 x 22 ## iso2 year m04 m514 m014 m1524 m2534 m3544 m4554 m5564 m65 mu f04 f514 f014 f1524 ## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> ## 1 AD 1989 NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## 2 AD 1990 NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## 3 AD 1991 NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## 4 AD 1992 NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## 5 AD 1993 NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## 6 AD 1994 NA NA NA NA NA NA NA NA NA NA NA NA NA NA ## 7 AD 1996 NA NA 0 0 0 4 1 0 0 NA NA NA 0 1 ## 8 AD 1997 NA NA 0 0 1 2 2 1 6 NA NA NA 0 1 ## 9 AD 1998 NA NA 0 0 0 1 0 0 0 NA NA NA NA NA ## 10 AD 1999 NA NA 0 0 0 1 1 0 0 NA NA NA 0 0 ## # … with 5,759 more rows, and 6 more variables: f2534 <int>, f3544 <int>, f4554 <int>, f5564 <int>, ## # f65 <int>, fu <int> ``` --- # tidyr that's better!! ``` ## # A tibble: 35,750 x 5 ## iso2 year sex age_group n ## <chr> <int> <chr> <chr> <int> ## 1 AD 1996 f 014 0 ## 2 AD 1996 f 1524 1 ## 3 AD 1996 f 2534 1 ## 4 AD 1996 f 3544 0 ## 5 AD 1996 f 4554 0 ## 6 AD 1996 f 5564 1 ## 7 AD 1996 f 65 0 ## 8 AD 1996 m 014 0 ## 9 AD 1996 m 1524 0 ## 10 AD 1996 m 2534 0 ## # … with 35,740 more rows ``` --- # tidyr ```r species ``` ``` ## # A tibble: 14,778 x 15 ## # Groups: Codi, Especie [14,778] ## Codi Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60 CD_65 CD_70 ## * <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 022 0 159. 31.8 111. 60.1 19.2 5.09 0 0 0 0 0 0 ## 2 080002 021 0 0 0 0 0 74.2 28.3 63.7 0 0 0 0 0 ## 3 080002 022 0 0 0 173. 31.8 0 0 0 0 0 0 0 0 ## 4 080002 476 0 0 0 0 0 31.8 0 0 0 0 0 0 0 ## 5 080003 021 0 0 0 31.8 0 0 0 0 5.09 0 0 0 0 ## 6 080003 022 0 127. 0 0 46.0 127. 0 14.1 0 0 0 0 0 ## 7 080004 021 0 31.8 0 0 31.8 0 0 0 0 0 0 0 0 ## 8 080005 071 0 0 0 14.1 5.09 31.8 0 0 0 0 31.8 0 0 ## 9 080005 243 0 0 14.1 0 14.1 0 5.09 0 0 0 0 0 0 ## 10 080005 278 0 286. 31.8 0 0 31.8 0 0 0 0 0 0 0 ## # … with 14,768 more rows ``` --- # tidyr ## The verbs of tidyr - `gather`: Convert data from *wide* to *long* format (columns to id-value pairs) - `spread`: Convert data from *long* to *wide* format (id-value pairs to columns) - `separate`: Convert one column in serveral - `unite`: Join several columns in one --- # tidyr ## common structure (for most of the tidyverse) ```r verb(data, ...) ``` - first argument: data (as data.frame or tbl_df) - the rest of arguments specify what to do with the data frame - output is always another data frame (tbl_df or data.frame) - unless we are assigning (`<-`), never modifies the original data frame --- .middle[.center[.font200[ `gather` & `separate` ]]] .center[] .center[] --- # tidyr ```r n_parcelas <- tibble( Prov = c('Lleida', 'Girona', 'Barcelona', 'Tarragona'), IFN_2 = c(16, 78, 60, 34), IFN_3 = c(18, 79, 67, 36) ) n_parcelas ``` ``` ## # A tibble: 4 x 3 ## Prov IFN_2 IFN_3 ## <chr> <dbl> <dbl> ## 1 Lleida 16 18 ## 2 Girona 78 79 ## 3 Barcelona 60 67 ## 4 Tarragona 34 36 ``` --- # tidyr: gather ## `gather(df, key, value, vars)` ```r n_parcelas_tidy <- gather(n_parcelas,IFN, n, IFN_2, IFN_3) ``` .pull-left[ ```r n_parcelas ``` ``` ## # A tibble: 4 x 3 ## Prov IFN_2 IFN_3 ## <chr> <dbl> <dbl> ## 1 Lleida 16 18 ## 2 Girona 78 79 ## 3 Barcelona 60 67 ## 4 Tarragona 34 36 ``` ] .pull-right[ ```r n_parcelas_tidy ``` ``` ## # A tibble: 8 x 3 ## Prov IFN n ## <chr> <chr> <dbl> ## 1 Lleida IFN_2 16 ## 2 Girona IFN_2 78 ## 3 Barcelona IFN_2 60 ## 4 Tarragona IFN_2 34 ## 5 Lleida IFN_3 18 ## 6 Girona IFN_3 79 ## 7 Barcelona IFN_3 67 ## 8 Tarragona IFN_3 36 ``` ] --- # tidyr: separate ## `separate(df, col, into, sep)` ```r n_parcelas_sep <- separate(n_parcelas_tidy, IFN, c('source', 'version'), sep = '_') n_parcelas_sep ``` ``` ## # A tibble: 8 x 4 ## Prov source version n ## <chr> <chr> <chr> <dbl> ## 1 Lleida IFN 2 16 ## 2 Girona IFN 2 78 ## 3 Barcelona IFN 2 60 ## 4 Tarragona IFN 2 34 ## 5 Lleida IFN 3 18 ## 6 Girona IFN 3 79 ## 7 Barcelona IFN 3 67 ## 8 Tarragona IFN 3 36 ``` --- # `gather` and `separate` ## Exercise 9 Use `gather` and `separate` to transform the data frame species into a **tidy** format, where each column is a variable and each row an observation ``` ## # A tibble: 14,778 x 15 ## # Groups: Codi, Especie [14,778] ## Codi Especie CD_10 CD_15 CD_20 CD_25 CD_30 CD_35 CD_40 CD_45 CD_50 CD_55 CD_60 CD_65 CD_70 ## * <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 022 0 159. 31.8 111. 60.1 19.2 5.09 0 0 0 0 0 0 ## 2 080002 021 0 0 0 0 0 74.2 28.3 63.7 0 0 0 0 0 ## 3 080002 022 0 0 0 173. 31.8 0 0 0 0 0 0 0 0 ## 4 080002 476 0 0 0 0 0 31.8 0 0 0 0 0 0 0 ## 5 080003 021 0 0 0 31.8 0 0 0 0 5.09 0 0 0 0 ## 6 080003 022 0 127. 0 0 46.0 127. 0 14.1 0 0 0 0 0 ## 7 080004 021 0 31.8 0 0 31.8 0 0 0 0 0 0 0 0 ## 8 080005 071 0 0 0 14.1 5.09 31.8 0 0 0 0 31.8 0 0 ## 9 080005 243 0 0 14.1 0 14.1 0 5.09 0 0 0 0 0 0 ## 10 080005 278 0 286. 31.8 0 0 31.8 0 0 0 0 0 0 0 ## # … with 14,768 more rows ``` --- .middle[.center[.font200[ `spread` & `unite` ]]] .center[] .center[] --- # tidyr: unite ## `unite(data, col, vars, sep)` ```r n_parcelas_unite <- unite(n_parcelas_sep, IFN, source, version, sep = '_') n_parcelas_unite ``` ``` ## # A tibble: 8 x 3 ## Prov IFN n ## <chr> <chr> <dbl> ## 1 Lleida IFN_2 16 ## 2 Girona IFN_2 78 ## 3 Barcelona IFN_2 60 ## 4 Tarragona IFN_2 34 ## 5 Lleida IFN_3 18 ## 6 Girona IFN_3 79 ## 7 Barcelona IFN_3 67 ## 8 Tarragona IFN_3 36 ``` --- # tidyr: spread ## `spread(df, key, value, sep)` ```r n_parcelas2 <- spread(n_parcelas_unite, IFN, n) ``` .pull-left[ ```r n_parcelas ``` ``` ## # A tibble: 4 x 3 ## Prov IFN_2 IFN_3 ## <chr> <dbl> <dbl> ## 1 Lleida 16 18 ## 2 Girona 78 79 ## 3 Barcelona 60 67 ## 4 Tarragona 34 36 ``` ] .pull-right[ ```r n_parcelas2 ``` ``` ## # A tibble: 4 x 3 ## Prov IFN_2 IFN_3 ## <chr> <dbl> <dbl> ## 1 Barcelona 60 67 ## 2 Girona 78 79 ## 3 Lleida 16 18 ## 4 Tarragona 34 36 ``` ] --- # `spread` and `unite` ## Exercise 10 Use `unite` and `spread` to transform the data from exercise 9 into its original format --- layout: false class: inverse background-image: url(resources/images/purrr.png) # purrr --- layout: true <div class="tweaked-header" style="background-image: url(resources/images/purrr.png)"></div> --- # purrr Allows to do functional programming with R, making loops and repetitive tasks pipe-friendly and easier to read. ```r c(1, 2, 3) %>% map_dbl(~ .*2) ``` ``` ## [1] 2 4 6 ``` ```r c('a', 'b', 'c') %>% map_chr(~ paste0('treatment_', .)) ``` ``` ## [1] "treatment_a" "treatment_b" "treatment_c" ``` Can be used as alternatives to `apply` family functions (`lapply`, `vapply`...) --- # purrr ## common structure ```r verb(.x, .f, ...) ``` - first argument: vector or list (included data frames) - second argument: function to apply to each element of .x - output is always a vector or a list of the same length as .x --- # purrr ## The verbs of purrr - `map`: Tranforms the input by applying a function to each element and returning a vector the same length as the input - `walk`: The same as `map`, but only for the side-effect, returning the original input --- # map ## Flavours - `map`: Returns always a list - `map_chr`, `map_dbl`, `map_int`, `map_lgl`: Returns a vector of the corresponding type - `map_dfr`, `map_dfc`: Returns a data frame created by row- or column-binding --- # map ## functions Functions to apply to each element can be supplied in different forms: - lambda function: starting with `~` and with `.` as the element placeholder ```r list.files('data', '.csv', full.names = TRUE) %>% map_dfr(~ read_csv(., col_types = 'ccccc???????')) ``` - function name: if it needs extra arguments, can be supplied after the function name ```r list.files('data', '.csv', full.names = TRUE) %>% map_dfr(read_csv, col_types = 'ccccc???????') ``` --- # map ## Models workflow become easier ```r trees %>% split(.$Provincia) %>% map(~ lm(HeiIf3 ~ DiamIf3, data = .)) %>% map_dfr(~ broom::tidy(.)) ``` ``` ## term estimate std.error statistic p.value ## 1 (Intercept) 5.0440093 0.046359811 108.80134 0 ## 2 DiamIf3 0.2754189 0.001917213 143.65588 0 ## 3 (Intercept) 4.7398904 0.052552661 90.19316 0 ## 4 DiamIf3 0.2896370 0.002110413 137.24189 0 ## 5 (Intercept) 5.5570947 0.040415357 137.49958 0 ## 6 DiamIf3 0.2343659 0.001467420 159.71291 0 ## 7 (Intercept) 3.7542965 0.061363682 61.18108 0 ## 8 DiamIf3 0.2379400 0.002468751 96.38074 0 ``` --- class: code50 # walk ## Side-effects Sometimes we want to perform a pipe step only for its side-effect (printing, plotting...) but continue with the pipe afterwards. We could write an intermediate step or we can use walk ```r list.files('data', '.csv', full.names = TRUE) %>% walk(print) %>% map_dfr(read_csv, col_types = 'ccccc???????') ``` ``` ## [1] "data/trees_Barcelona.csv" ## [1] "data/trees_Girona.csv" ## [1] "data/trees_Lleida.csv" ## [1] "data/trees_Tarragona.csv" ``` ``` ## # A tibble: 111,756 x 12 ## Codi Provincia Cla Subclase Especie Rumbo Dist Fac CD DiamIf3 DiamIf2 HeiIf3 ## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 080001 08 A 1 022 7 8.3 31.8 20 20.3 18.9 9 ## 2 080002 08 A 1 476 38 9.1 31.8 35 34 32.4 9 ## 3 080003 08 A 1 021 25 7 31.8 25 24.8 17.6 11 ## 4 080004 08 A 1 021 28 8.89 31.8 15 16.8 12.6 9.5 ## 5 080006 08 A 1 021 19 11.2 14.1 35 34.0 30.9 13 ## 6 080007 08 A 1 021 32 12 14.1 35 33.1 28.2 10 ## 7 080008 08 A 1 243 40 7.8 31.8 15 15 13.2 6 ## 8 080009 08 A 1 045 16 5.09 31.8 20 17.5 15.3 7 ## 9 080010 08 A 1 243 47 26.9 5.09 65 67.4 66.8 16.5 ## 10 080013 08 A 1 022 44 2.7 127. 15 15.1 12.6 9.5 ## # … with 111,746 more rows ``` --- # purrr ## Exercise 11 In this exercise we will use some of the knowledge acquired in this workshop 11.1 Lets see if there is a relationship between leaf biomass (in the `leaf` dataset) and tree height (in the `trees` dataset), for each province at the plot level. For that you will need to summarise tree values, join with the leaf data, split by provinces and perform the model for each province. --- layout: false class: thanks clear middle .font160[Thank you!] <br>
@multivac42
https://github.com/ameztegui/
aitor.ameztegui@eagrof.udl.cat
@MalditoBarbudo
https://github.com/malditobarbudo/
v.granda@creaf.uab.cat <br> Presentation repository:
https://github.com/ameztegui/tidyverse_workshop