Quickstart demo

Author
Affiliations
John R Little

Duke University

Published

January 6, 2023

Outline

  1. Make a data folder

  2. Drag fav.csv into the data folder

  3. Make existing folder and RStudio project

  4. Open an R Markdown Notebook

  5. library(tidyverses) plus other libraries

  6. IMPORT data

    • See Also RStudio data import wizard
    • ATTACH data
  7. EDA: Visualize ggplot(data = starwars, aes(hair_color)) + geom_bar()

  8. EDA: skimr::skim(starwars)

  9. EDA: summary(fav_rating)

  10. left_join(starwars, fivethirtyeight)

  11. Transform data: five dplyr verbs …

    • filter, select, arrange, mutate
    • count / group_by & summarize
  12. Interactive visualization: ggplotly

  13. linear regression / models (quick syntax introduction)

  14. Reports: notebooks, slides, dashboards, word document, PDF, book, etc.


5. library(tidyverses) plus other libraries

library(tidyverse)
library(skimr)
library(plotly)
library(moderndive)
library(broom)

6. read_csv(file_name.csv)

See Also data import wizard

## fav_data <- read_csv("data/fav.csv")
favorability <- read_csv("https://raw.githubusercontent.com/libjohn/intro2r-code/master/data/538_favorability_popularity.csv", skip = 11)

7 attached on-board data

  • dplyr::starwars

dplyr::starwars

data("starwars")

8 Quick visualization

Visualize with the ggplot2 library.

plot <- ggplot(data = starwars, 
               aes(x = hair_color)) + 
  geom_bar()
plot

One improvement

Arrange bars by frequency using forcats::fct_infreq()

plot1 <- ggplot(starwars, 
                aes(fct_infreq(hair_color))) + 
  geom_bar()
plot1

9. skimr::skim(starwars)

The skimr library presents summary EDA results using the skim() function

skim(starwars)
Data summary
Name starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 12 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 17 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁

10. summary

summary(favorability)
     name             fav_rating   
 Length:14          Min.   :110.0  
 Class :character   1st Qu.:148.5  
 Mode  :character   Median :392.0  
                    Mean   :369.0  
                    3rd Qu.:559.5  
                    Max.   :610.0  

11. left_join(starwars, fivethirtyeight)

Joins or merges are part of thedplyr library.

starwars %>% 
  left_join(favorability, by = "name") %>% 
  select(name, fav_rating, everything()) %>% 
  arrange(-fav_rating)

12. Transform data:

From the dplyr library, use the five verbs …

select to subset data by columns

starwars %>% 
  select(name, gender, hair_color)

filter to subset data rows

starwars %>% 
  filter(gender == "feminine")

arrange to sort data

starwars %>% 
  arrange(desc(height), desc(name))

mutate to add new variable or transform existing

starwars %>%
  drop_na(mass) %>% 
  select(name, mass) %>% 
  mutate(big_mass = mass * 2)

count / group_by & summarize

subtotals of variables

starwars %>% 
  count(gender)

Variable totals (and also, but not here, calculations)

starwars %>% 
  drop_na(mass) %>% 
  summarise(sum(mass))

Variable subtotals and calculations

group_by(gender, species) %>% summarise(mean_height = mean(height), total = n())

starwars %>% 
  drop_na(height) %>% 
  group_by(gender, species) %>% 
  summarise(mean_height = mean(height), total = n()) %>% 
  arrange(desc(total)) %>%
  drop_na(species) %>%
  filter(total > 1) %>% 
  select(species, gender, total, everything())

13. Interactive visualization

from the plotly library

ggplotly(plot1)

14. Regression / models

Predict mass from height after eliminating Jabba from the data set. Here we’ll use primarily base R, moderndive for model outputs, and tidyverse for the pipe %>% and dplyr for data transformations. Plus, alternatively, the broom library to manipulate models.

model <- lm(mass ~ height, data = starwars %>% filter(mass < 500))
model

Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass < 
    500))

Coefficients:
(Intercept)       height  
   -32.5408       0.6214  
summary(model)

Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass < 
    500))

Residuals:
    Min      1Q  Median      3Q     Max 
-39.382  -8.212   0.211   3.846  57.327 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -32.54076   12.56053  -2.591   0.0122 *  
height        0.62136    0.07073   8.785 4.02e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared:  0.5795,    Adjusted R-squared:  0.572 
F-statistic: 77.18 on 1 and 56 DF,  p-value: 4.018e-12

A nice Explanation of Basic Regression can be found in chapter 5 of the book Statistical Inference via Data Science. You can also use the moderndive library packages to access helpful functions such as: get_correlatin(), get_regression_table(), etc.

You may also appreciate or prefer the broom package for the very nice tidy(), glance(), and augment() functions.

starwars %>% 
  filter(mass < 500) %>% 
  get_correlation(mass ~ height)
# tidy(model)
get_regression_table(model)
# broom::glance(model)
get_regression_summaries(model)
# broom::augment(model)
get_regression_points(model)

Visualize regression

mass over height with a fitted linear regression line and confidence interval using geom_smooth()

starwars %>% 
  filter(mass < 500) %>%
  ggplot(aes(height, mass)) +
  geom_jitter() +
  geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

15. Render reports

By changing the argument in the YAML header, you can render many report styles. A few popular examples include HTML, PDF, or MS Word and Power Point documents; Websites; slide-deck presentations; Books, and Interactivity. See more at the comprehensive guide to report outputs via quarto.