vignettes/articles/use_with_tidyverse.Rmd
use_with_tidyverse.Rmd
This vignette is adapted very heavily from Hadley Wickham’s incredible R for Data Science book. You should support Hadley and the work he does by buying it.
In addition to weathercan, you’ll need several packages from the tidyverse to complete the following analysis.
Your first decision that you need to make when analyzing data from weather stations across canada is to determine for which stations you’d like to query from Environment and Climate Change Canada. In this example, to keep processing time low, we will query two stations with very long records that happen to be far apart. To make that choice we can use (tidyverse)[http://tidyverse.org/] tools and the included stations
data frame in this package:
stations %>% filter(station_id %in% c(707, 4859, 6693,5397, 2315), interval == "day") %>% select(prov, station_name, station_id, start, end)
## # A tibble: 5 x 5
## prov station_name station_id start end
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 AB TABER 2315 1907 2021
## 2 BC AGASSIZ CDA 707 1889 2020
## 3 NL PORT AUX BASQUES 6693 1909 2017
## 4 ON BELLEVILLE 4859 1866 2021
## 5 QC LENNOXVILLE 5397 1888 2021
These two weather stations will be our test data for this vignette. You can broaden or expand your analysis by choosing different or more station. Our next step is to use the weather_dl()
function to load in the data.
The following will take quite some time to download as it is downloading over 100 years of daily data for 5 stations.
pancan_df <- weather_dl(station_ids = c(707, 4859, 6693,5397, 2315), interval = "day") %>% filter(year >= 1920) %>% select(station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id, TC_id, mean_temp, date)
ggplot(pancan_df, aes(x = date, y = mean_temp, colour = station_name)) + geom_point() + geom_line()
## Warning: Removed 28037 rows containing missing values (geom_point).
## Warning: Removed 8392 row(s) containing missing values (geom_path).
This is quite a large dataset.
pancan_df_nest <- pancan_df %>% group_by(station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id, TC_id) %>% nest() pancan_df_nest
## # A tibble: 5 x 10
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id data
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lis>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 2 BELLEVILLE 4859 ON 44.2 -77.4 76.2 6150689 <NA> <NA> <tib…
## 3 PORT AUX BA… 6693 NL 47.6 -59.2 39.7 8402975 71197 WZB <tib…
## 4 LENNOXVILLE 5397 QC 45.4 -71.8 181 7024280 71611 WQH <tib…
## 5 TABER 2315 AB 49.8 -112. 811 3036360 <NA> <NA> <tib…
Define the model
clim_model <- function(df) { lm(mean_temp ~ date, data = df) }
Run the model with the existing data
## # A tibble: 5 x 11
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id data
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lis>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 2 BELLEVILLE 4859 ON 44.2 -77.4 76.2 6150689 <NA> <NA> <tib…
## 3 PORT AUX BA… 6693 NL 47.6 -59.2 39.7 8402975 71197 WZB <tib…
## 4 LENNOXVILLE 5397 QC 45.4 -71.8 181 7024280 71611 WQH <tib…
## 5 TABER 2315 AB 49.8 -112. 811 3036360 <NA> <NA> <tib…
## # … with 1 more variable: model <list>
Then add the residuals to the model
pancan_df_nest <- pancan_df_nest %>% mutate(model = map(data, clim_model), resids = map2(data, model, add_residuals)) pancan_df_nest
## # A tibble: 5 x 12
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id data
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lis>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 2 BELLEVILLE 4859 ON 44.2 -77.4 76.2 6150689 <NA> <NA> <tib…
## 3 PORT AUX BA… 6693 NL 47.6 -59.2 39.7 8402975 71197 WZB <tib…
## 4 LENNOXVILLE 5397 QC 45.4 -71.8 181 7024280 71611 WQH <tib…
## 5 TABER 2315 AB 49.8 -112. 811 3036360 <NA> <NA> <tib…
## # … with 2 more variables: model <list>, resids <list>
We can unnest the results then plot them
unnest()
resids <- unnest(pancan_df_nest, resids) resids
## # A tibble: 184,520 x 14
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id data
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lis>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 2 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 3 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 4 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 5 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 6 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 7 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 8 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 9 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 10 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## # … with 184,510 more rows, and 4 more variables: model <list>,
## # mean_temp <dbl>, date <date>, resid <dbl>
ggplot(data = resids, aes(date, resid)) + geom_line(aes(group = station_name), alpha = 1 / 3) + geom_point() + geom_hline(yintercept = 0) + facet_wrap(~ station_name, ncol = 1)
## Warning: Removed 8392 row(s) containing missing values (geom_path).
## Warning: Removed 28037 rows containing missing values (geom_point).
glance_df <- pancan_df_nest %>% mutate(glance = map(model, broom::glance)) %>% unnest(glance, .drop = TRUE) %>% select(station_name, prov, r.squared, p.value, AIC)
## Warning: The `.drop` argument of `unnest()` is deprecated as of tidyr 1.0.0.
## All list-columns are now preserved.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
station_id | lat | lon | elev | climate_id | WMO_id | TC_id | station_name | prov | r.squared | p.value | AIC |
---|---|---|---|---|---|---|---|---|---|---|---|
707 | 49.24 | -121.76 | 15.0 | 1100120 | NA | NA | AGASSIZ CDA | BC | 0.0030139 | 0.0000000 | 234152.1 |
4859 | 44.15 | -77.39 | 76.2 | 6150689 | NA | NA | BELLEVILLE | ON | 0.0028008 | 0.0000000 | 269056.9 |
6693 | 47.57 | -59.15 | 39.7 | 8402975 | 71197 | WZB | PORT AUX BASQUES | NL | 0.0041296 | 0.0000000 | 178541.2 |
5397 | 45.37 | -71.82 | 181.0 | 7024280 | 71611 | WQH | LENNOXVILLE | QC | 0.0011577 | 0.0000000 | 282705.3 |
2315 | 49.79 | -112.12 | 811.0 | 3036360 | NA | NA | TABER | AB | 0.0003432 | 0.0044037 | 182693.3 |
preds <- pancan_df_nest %>% mutate(model = map(data, clim_model), preds = map2(data, model, add_predictions)) %>% unnest(preds) preds
## # A tibble: 184,520 x 15
## # Groups: station_name, station_id, prov, lat, lon, elev, climate_id, WMO_id,
## # TC_id [5]
## station_name station_id prov lat lon elev climate_id WMO_id TC_id data
## <chr> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <lis>
## 1 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 2 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 3 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 4 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 5 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 6 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 7 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 8 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 9 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## 10 AGASSIZ CDA 707 BC 49.2 -122. 15 1100120 <NA> <NA> <tib…
## # … with 184,510 more rows, and 5 more variables: model <list>, resids <list>,
## # mean_temp <dbl>, date <date>, pred <dbl>
ggplot(data = preds, aes(x = date, y = mean_temp, colour = station_name)) + geom_point() + geom_line(aes(y = pred)) + facet_wrap(~ station_name, scales = "free_y", ncol = 1)
## Warning: Removed 28037 rows containing missing values (geom_point).