vignettes/example-packages.Rmd
This vignette explores R package download trends using the cranlogs
package.
Write the code files to your workspace.
drake_example("packages")
The new packages
folder now includes a file structure of a serious drake
project, plus an interactive-tutorial.R
to narrate the example. The code is also online here.
This small data analysis project explores some trends in R package downloads over time. The datasets are downloaded using the cranlogs package.
library(cranlogs)
cran_downloads(packages = "dplyr", when = "last-week")
## date count package
## 1 2018-03-09 12965 dplyr
## 2 2018-03-10 7093 dplyr
## 3 2018-03-11 6941 dplyr
## 4 2018-03-12 15044 dplyr
## 5 2018-03-13 15207 dplyr
## 6 2018-03-14 15459 dplyr
## 7 2018-03-15 15344 dplyr
Above, each count is the number of times dplyr
was downloaded from the RStudio CRAN mirror on the given day. To stay up to date with the latest download statistics, we need to refresh the data frequently. With drake
, we can bring all our work up to date without restarting everything from scratch.
First, we load the required packages. Drake
knows about the packages you install and load.
library(drake)
library(cranlogs)
library(ggplot2)
library(knitr)
library(plyr)
We want to explore the daily downloads from these packages.
package_list <- c(
"knitr",
"Rcpp",
"ggplot2"
)
We plan to use the cranlogs package. The data frames older
and recent
will contain the number of daily downloads for each package from the RStudio CRAN mirror.
data_plan <- drake_plan(
recent = cran_downloads(packages = package_list, when = "last-month"),
older = cran_downloads(
packages = package_list,
from = "2016-11-01",
to = "2016-12-01"
),
strings_in_dots = "literals"
)
data_plan
## # A tibble: 2 x 2
## target command
## <chr> <chr>
## 1 recent "cran_downloads(packages = package_list, when = \"last-month\")"
## 2 older "cran_downloads(packages = package_list, from = \"2016-11-01\", …
We want to summarize each set of download statistics a couple different ways.
output_types <- drake_plan(
averages = make_my_table(dataset__),
plot = make_my_plot(dataset__)
)
output_types
## # A tibble: 2 x 2
## target command
## <chr> <chr>
## 1 averages make_my_table(dataset__)
## 2 plot make_my_plot(dataset__)
We need to define functions to summarize and plot the data.
make_my_table <- function(downloads){
ddply(downloads, "package", function(package_downloads){
data.frame(mean_downloads = mean(package_downloads$count))
})
}
make_my_plot <- function(downloads){
ggplot(downloads) +
geom_line(aes(x = date, y = count, group = package, color = package))
}
Below, the targets recent
and older
each take turns substituting the dataset__
wildcard. Thus, output_plan
has four rows.
output_plan <- plan_analyses(
plan = output_types,
datasets = data_plan
)
output_plan
## # A tibble: 4 x 2
## target command
## <chr> <chr>
## 1 averages_recent make_my_table(recent)
## 2 averages_older make_my_table(older)
## 3 plot_recent make_my_plot(recent)
## 4 plot_older make_my_plot(older)
We plan to weave the results together in a dynamic knitr report.
report_plan <- drake_plan(
knit(knitr_in("report.Rmd"), file_out("report.md"), quiet = TRUE)
)
report_plan
## # A tibble: 1 x 2
## target command
## <chr> <chr>
## 1 "\"report.md\"" "knit(knitr_in(\"report.Rmd\"), file_out(\"report.md\")…
Because of the mention of knitr_in()
above, make()
will look dependencies inside report.Rmd
(targets mentioned with loadd()
or readd()
in active code chunks). That way, whenever a dependency changes, drake
will rebuild report.md
when you call make()
. For that to happen, we need report.Rmd
to exist before the call to make()
. For this example, you can find report.Rmd here.
Now, we complete the workflow plan data frame by concatenating the results together. Drake
analyzes the plan to figure out the dependency network, so row order does not matter.
whole_plan <- rbind(
data_plan,
output_plan,
report_plan
)
whole_plan
## # A tibble: 7 x 2
## target command
## <chr> <chr>
## 1 recent "cran_downloads(packages = package_list, when = \"last-…
## 2 older "cran_downloads(packages = package_list, from = \"2016-…
## 3 averages_recent make_my_table(recent)
## 4 averages_older make_my_table(older)
## 5 plot_recent make_my_plot(recent)
## 6 plot_older make_my_plot(older)
## 7 "\"report.md\"" "knit(knitr_in(\"report.Rmd\"), file_out(\"report.md\")…
The latest download data needs to be refreshed every day, so we use triggers to force recent
to always build. For more on triggers, see the vignette on debugging and testing. Instead of triggers, we could have just made recent
a global variable like package_list
instead of a formal target in whole_plan
.
whole_plan$trigger <- "any" # default trigger
whole_plan$trigger[whole_plan$target == "recent"] <- "always"
whole_plan
## # A tibble: 7 x 3
## target command trigger
## <chr> <chr> <chr>
## 1 recent "cran_downloads(packages = package_list, when =… always
## 2 older "cran_downloads(packages = package_list, from =… any
## 3 averages_recent make_my_table(recent) any
## 4 averages_older make_my_table(older) any
## 5 plot_recent make_my_plot(recent) any
## 6 plot_older make_my_plot(older) any
## 7 "\"report.md\"" "knit(knitr_in(\"report.Rmd\"), file_out(\"repo… any
Now, we run the project to download the data and analyze it. The results will be summarized in the knitted report, report.md
, but you can also read the results directly from the cache.
make(whole_plan)
## Loading required package: methods
## target older
## target recent: trigger "always"
## target averages_older
## target averages_recent
## target plot_older
## target plot_recent
## target file "report.md"
## Used non-default triggers. Some targets may be not be up to date.
readd(averages_recent)
## package mean_downloads
## 1 ggplot2 15114.53
## 2 knitr 10463.70
## 3 Rcpp 19534.67
readd(averages_older)
## package mean_downloads
## 1 ggplot2 14641.29
## 2 knitr 9068.71
## 3 Rcpp 14408.06
readd(plot_recent)
readd(plot_older)
Because we used triggers, each make()
rebuilds the recent
target to get the latest download numbers for today. If the newly-downloaded data are the same as last time and nothing else changes, drake
skips all the other targets.
make(whole_plan)
## Unloading targets from environment:
## averages_recent
## plot_older
## plot_recent
## averages_older
## target recent: trigger "always"
## Used non-default triggers. Some targets may be not be up to date.
To visualize the build behavior, plot the dependency network. Target recent
and everything depending on it is always out of date because of the "always"
trigger. If you rerun the project tomorrow, the recent
dataset will have shifted one day forward, so make()
will refresh averages_recent
, plot_recent
, and report.md
. Targets averages_older
and plot_older
should be unaffected, so drake
will skip them.
config <- drake_config(whole_plan)
vis_drake_graph(config)
When you rely on data from the internet, you should trigger a new download when the data change remotely. This section of the best practices guide explains how to automatically refresh the data when the online timestamp changes.