library(tidyverse)
library(skimr)
library(plotly)
library(moderndive)
library(broom)Quickstart demo
Outline
Make a data folder
Drag fav.csv into the data folder
Make existing folder and RStudio project
Open an R Markdown Notebook
library(tidyverses)plus other librariesIMPORT data
- See Also RStudio data import wizard
- ATTACH data
EDA: Visualize
ggplot(data = starwars, aes(hair_color)) + geom_bar()EDA:
skimr::skim(starwars)EDA: summary(fav_rating)
left_join(starwars, fivethirtyeight)Transform data: five dplyr verbs …
filter,select,arrange,mutatecount/group_by&summarize
Interactive visualization:
ggplotlylinear regression / models (quick syntax introduction)
Reports: notebooks, slides, dashboards, word document, PDF, book, etc.
5. library(tidyverses) plus other libraries
6. read_csv(file_name.csv)
See Also data import wizard
## fav_data <- read_csv("data/fav.csv")
favorability <- read_csv("https://raw.githubusercontent.com/libjohn/intro2r-code/master/data/538_favorability_popularity.csv", skip = 11)7 attached on-board data
- dplyr::starwars
dplyr::starwars
data("starwars")8 Quick visualization
Visualize with the ggplot2 library.
plot <- ggplot(data = starwars,
aes(x = hair_color)) +
geom_bar()
plot
One improvement
Arrange bars by frequency using forcats::fct_infreq()
plot1 <- ggplot(starwars,
aes(fct_infreq(hair_color))) +
geom_bar()
plot1
9. skimr::skim(starwars)
The skimr library presents summary EDA results using the skim() function
skim(starwars)| Name | starwars |
| Number of rows | 87 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 8 |
| list | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
| hair_color | 5 | 0.94 | 4 | 13 | 0 | 12 | 0 |
| skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
| eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
| sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
| gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
| homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
| species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
| skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
|---|---|---|---|---|---|
| films | 0 | 1 | 24 | 1 | 7 |
| vehicles | 0 | 1 | 11 | 0 | 2 |
| starships | 0 | 1 | 17 | 0 | 5 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| height | 6 | 0.93 | 174.36 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▁▁▇▅▁ |
| mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
| birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
10. summary
summary(favorability) name fav_rating
Length:14 Min. :110.0
Class :character 1st Qu.:148.5
Mode :character Median :392.0
Mean :369.0
3rd Qu.:559.5
Max. :610.0
11. left_join(starwars, fivethirtyeight)
Joins or merges are part of thedplyr library.
starwars %>%
left_join(favorability, by = "name") %>%
select(name, fav_rating, everything()) %>%
arrange(-fav_rating)12. Transform data:
From the dplyr library, use the five verbs …
select to subset data by columns
starwars %>%
select(name, gender, hair_color)filter to subset data rows
starwars %>%
filter(gender == "feminine")arrange to sort data
starwars %>%
arrange(desc(height), desc(name))mutate to add new variable or transform existing
starwars %>%
drop_na(mass) %>%
select(name, mass) %>%
mutate(big_mass = mass * 2)count / group_by & summarize
subtotals of variables
starwars %>%
count(gender)Variable totals (and also, but not here, calculations)
starwars %>%
drop_na(mass) %>%
summarise(sum(mass))Variable subtotals and calculations
group_by(gender, species) %>% summarise(mean_height = mean(height), total = n())
starwars %>%
drop_na(height) %>%
group_by(gender, species) %>%
summarise(mean_height = mean(height), total = n()) %>%
arrange(desc(total)) %>%
drop_na(species) %>%
filter(total > 1) %>%
select(species, gender, total, everything())13. Interactive visualization
from the plotly library
ggplotly(plot1)14. Regression / models
Predict mass from height after eliminating Jabba from the data set. Here we’ll use primarily base R, moderndive for model outputs, and tidyverse for the pipe %>% and dplyr for data transformations. Plus, alternatively, the broom library to manipulate models.
model <- lm(mass ~ height, data = starwars %>% filter(mass < 500))
model
Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass <
500))
Coefficients:
(Intercept) height
-32.5408 0.6214
summary(model)
Call:
lm(formula = mass ~ height, data = starwars %>% filter(mass <
500))
Residuals:
Min 1Q Median 3Q Max
-39.382 -8.212 0.211 3.846 57.327
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -32.54076 12.56053 -2.591 0.0122 *
height 0.62136 0.07073 8.785 4.02e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.14 on 56 degrees of freedom
Multiple R-squared: 0.5795, Adjusted R-squared: 0.572
F-statistic: 77.18 on 1 and 56 DF, p-value: 4.018e-12
A nice Explanation of Basic Regression can be found in chapter 5 of the book Statistical Inference via Data Science. You can also use the moderndive library packages to access helpful functions such as: get_correlatin(), get_regression_table(), etc.
You may also appreciate or prefer the broom package for the very nice tidy(), glance(), and augment() functions.
starwars %>%
filter(mass < 500) %>%
get_correlation(mass ~ height)# tidy(model)
get_regression_table(model)# broom::glance(model)
get_regression_summaries(model)# broom::augment(model)
get_regression_points(model)Visualize regression
mass over height with a fitted linear regression line and confidence interval using geom_smooth()
starwars %>%
filter(mass < 500) %>%
ggplot(aes(height, mass)) +
geom_jitter() +
geom_smooth(method = "lm")`geom_smooth()` using formula = 'y ~ x'

15. Render reports
By changing the argument in the YAML header, you can render many report styles. A few popular examples include HTML, PDF, or MS Word and Power Point documents; Websites; slide-deck presentations; Books, and Interactivity. See more at the comprehensive guide to report outputs via quarto.