Process raw data into packaged, analysis-ready data sets, reproducibly.
A data set may consist of multiple sources of raw data. These need to be cleaned, standardized, QCd, and any number of other manipulations applied before they are ready for analysis.
This package is designed to simplify and centralize tasks associated with tidying a variety of data sets associated with a single project.
It provides a mechanism to collect, organize and version the data munging scripts used to process incoming data into analytical data sets. The package runs these scripts to perform the data munging writes out analytical data sets, which are combined with documentation, and built into a new R package. The package and the included data are versioned. If updated data arrive, the package can be rebuilt, and the data version incremented to reflect the changes. If the data munging code is in the form of Rmarkdown documents, these are processed automatically into package vignettes that are included in the final pacakge.
No other restrictions are placed on the data munging code.
Set up a new data package. Assume we have data munging code in MungeDataset1.Rmd
, and MungeDatast2.Rmd
, and each of these produce R objects dataset1
and dataset2
.
library(DataPackageR)
setwd("/tmp")
DataPackageR::datapackage.skeleton("MyNewStudy",force=TRUE,code_files = c("/tmp/MungeDataset1.Rmd","/tmp/MungeDataset2.Rmd"),r_object_names = c("dataset1","dataset2"))
Creating directories ...
Creating DESCRIPTION ...
Creating NAMESPACE ...
Creating Read-and-delete-me ...
Saving functions and data ...
Making help files ...
Done.
Further steps are described in './MyNewStudy/Read-and-delete-me'.
Adding DataVersion string to DESCRIPTION
Creating data and data-raw directories
The above code creates a directory “MyNewStudy” with the skeleton of a data package.
The DESCRIPTION
file should be filled out to describe your package. A new DataVersion
string now appears in that file. The revision is automatically incremented if the package data changes.
Read-and-delete-me
has some helpful instructions on how to proceed.
The data-raw
directory is where the data cleaning code (Rmd
) files reside. The contents of this directory are:
MyNewStudy/data-raw
└── datasets.R
└── MungeDataset1.Rmd
└── MungeDataset2.Rmd
datasets.R
can be edited as necessary (see below). This “master” file sources your data munging scripts. Data munging scripts can read data from anywhere, but it is good practice to have your “raw” data live under /inst/extdata
. It should be copied into that path and the data munging scripts edited appropriately.
Here are the contents on datasets.R
:
pkgName <- roxygen2:::read.description("../DESCRIPTION")$Package
# ------------------------------------------------------------
# Source additional R scripts to preprocess assay data
library(rmarkdown)
render('MungeDataset1.Rmd', envir=topenv(), output_dir='../inst/extdata/Logfiles', clean=FALSE)
render('MungeDataset2.Rmd', envir=topenv(), output_dir='../inst/extdata/Logfiles', clean=FALSE)
# for a systematically-named sequence of scripts, one could do something like this:
# for(fn in list.files(path="./", pattern="^preprocess_.*\\.Rmd$")){
# render(fn, envir=topenv(),output_dir="../inst/extdata/Logfiles",clean=FALSE)
# }
# Or a full path to each Rmd file can be passed to datapacakge.skeleton via code_files.
# ------------------------------------------------------------
# Define data objects to keep in the package
# (defining here because the list is useful when building roxygen documentation)
objectsToKeep <- c('dataset1', 'dataset2', 'etc.') # if it's a collection of unsystematically-named objects
# objectsToKeep <- ls(pattern=pkgName) # if you can define a rule that describes the naming of objects to be available in the package
# Or these can be passed into datapackage.skeleton via the r_object_names parameter
# ------------------------------------------------------------
# Auto build roxygen documentation
# On first build, we generate boilerplate roxygen documentation using DataPackageR:::.autoDoc()
# User then manually edits the output file edit_and_rename_to_'documentation.R'.R and renames it to documentation.R.
# The documentation.R file is then used for all subsequent builds.
if(file.exists("documentation.R")){
sys.source('documentation.R', envir=topenv())
} else {
DataPackageR:::.autoDoc(pkgName, objectsToKeep, topenv())
}
# keep only objects labeled for retention
keepDataObjects(objectsToKeep)
We look at this piece by piece.
First, we load the rmarkdown package and then render the user-provided data processing code MungeDataset1.Rmd
, and MungeDataset2.Rmd
.
r_object_names
of datapackage.skeleton
.The product of this particular script will be an html document that serves as a log of how the data were processed.
vignette
in the final package.The most important product of processing script is one or more R objects.
keepDataObjects()
tells the build process which objects should be retained and stored as part of the data package.dataset1
and dataset2
.keepDataObjects('dataset1','dataset2')
tells the build process the name of the object to store in the package.All this is taken care of via arguments to datapackage.skeleton
.
data
. The build process will handle this for you.If everything is in order, they will be included in the built package.
There is a call to .autoDoc
, which generates documentation for the package and the objects on the first run of the build.
It produces a file that the user needs to rename and edit by hand.
The contents of this file are roxygen blocks that are parsed into object and package documentation.
Once your scripts are in place and the data objects are documented, you build the package.
To run the build process:
# Within the package directory
DataPackageR:::buildDataSetPackage(".") #note for a first build this needs to be run twice and the
#documentation edited.
If there are errors, the script will notify you of any problems.
If everything goes smoothly, you will have a new package built in the parent directory.
This can be distributed, installed using R CMD INSTALL
, and data sets loaded using R’s standard data()
call. Vignettes can be interrogated via vignette(package="mypackage")
The DataPackageR package calculates an md5 checksum of each data object it stores, and keeps track of them in a file called DATADIGEST
.
DataVersion
string has been incremented in the DESCRIPTION
file.Your downstream data analysis can depend on a specific version of your data package (for example by tesing the packageVersion()
string);
if(packageVersion("MyNewStudy") != "1.0.0")
stop("The expected version of MyNewStudy is 1.0.0, but ",packageVersion("MyNewStudy")," is installed! Analysis results may differ!")
The DataPackageR packge also provides datasetVersion()
to extract the data set version information.
You should also place the data package source directory under git
version control. This allows you to version control your data processing code.