Over the course of working with the Neotoma Paleoecology Database (Williams et al., 2018) and the age modeling software Bacon (Blaauw and Christen, 2011) it has often become neccessary to run a large number of Bacon models in an itterative process (for example the Stepps-Baconizing GitHub repository). The process of building and re-building a large number of age models is often complicated and can result in a large amount of code. This code might involve fine-tuning results for single sites, using information from multiple runs, the generation of multiple data files and a set of R code that can span hundreds of lines. In addition, the complexities of the underlying data (for example, issues with existing chronologies and data structures within Neotoma) may mean that users might unintentionally introduce errors into their workflows. Data repositories serve a unique position within the research community (Goring et al., 2018), and one service they can provide is the distillation and distribution of best-practices documents such as this resource for generating a large number of Bacon age models from Neotoma data.
This repository makes use of experience obtained in working through several projects (Dawson et al., 2016; Kujawa et al., 2016), and uses data from several projects as the basis for establishing certain prior values. We make an effort in this document to clearly explain the steps neccessary to undertake such a project in a fairly straightforward manner.
The general workflow is encoded in the Rmd document that can be rendered to an HTML file. Rmd documents knit together descriptive text and R commands (read more about RMarkdown documents). The Rmd document is intended to provide context and guide the user in some best practices for the generation of Bacon outputs in batch. The Rmd is also intended to operate as a stand-alone tool to generate a number of age records, and output the PDF ghost plots, age files and posterior draws for the cores of interest. The R code that is embedded in the Rmd
file does the following:
The key steps of the workflow:
settings.yaml
file. See the text below for information about the settings that can be used for these runs.neotoma
package.The final step in this process requires modifying the settings file after the first run of this workflow, and adjusting the the parameters in the parameters file generated by the run, to ensure that the Bacon runs for each core reflect the best possible age models. When the Rmd is re-run with the settings.yaml
and parameters file adjusted (see Rmd for details), it is possible to do the batch processing faster, since the script is set to run only cores that have not had successful runs. Since “good” chronologies are excluded from being re-run it is possible to efficiently modify the settings for one or a few cores, while leaving the rest unchanged.
This code is provided as an RMarkdown document to help guide the user. It can support extensive customization if a user chooses to make changes anywhere in the code, and individuals are welcome to make or suggest changes as they wish.
Many people working with R may choose to use RStudio. If you are using RStudio you can customize elements of this document (for example, removing this header information) and then use the knit button to compile the code into an HTML document, while generating the neccessary output files (see the README file in the GitHub repository).
Users who use the R Console without any other editor can move to the appropriate working directory (using setwd()
) and then run the entire document using the following command:
library(rmarkdown)
render('bulk_baconizing.Rmd')
This can be shortened to: rmarkdown::render('bulk_baconizing.Rmd')
.
Similarly, if you wish to execute the code from a console or terminal you can navigate to the working directory and execute the following:
Rscript -e "rmarkdown::render('bulk_baconizing.Rmd')"
Note that with the terminal you need to use two different quotation marks when executing. The inner and outer quotations must be different (e.g., double on the outside, single on the inside, or vice versa). Whether you customize the code, or choose to run it as-is (for just the US state of Utah), these three methods of executing the code are your main options.
If, in any place instructions are not clear or more details are required please feel free to contact us by either raising a GitHub Issue for this repository, or by mailing me directly at goring@wisc.edu.
This document is intended to be used as a template for users, and not as a solution in and of itself. The process for generating chronologies is itterative, as such, the use of this Rmd script is intended to be an itterative process, whereby you select sites, run Bacon, revise parameters and run the script again. Each itteration will involve modifying the parameters file (created in the data/params
folder during your first run), and also the settings.yaml
file. Please be sure to check carefully as you do this. Errors may result in long wait times, or runs that provide no new useful information.
Because there are a number of elements to this project, it was decided to place some of the main elements into a single, easy to use file. The project stores these settings in the parent directory in a file called settings.yaml
. By default these settings are:
# Should the program remove any existing files in the Cores directory and start again?
clean_run: true
bacon: true
bchron: true
# What version of the run is this? This will create a new parameters file if none exists.
version: 1
# If a parameters file exists with the current version number should that file overwritten?
reset_parameters: true
# Where will the Bacon core files go?
core_path: 'Cores'
# For any records with assigned dates, what date should be used (either 'YYYY-MM-DD' or 'today')
date: 'today'
# Core top error (if unspecified)
core_top_err: 2
# Use modern hiatus break:
modern: true
# Prior thickness
thickness: data/paleon_thick.csv
# Settlement indicators
settlement: data/expert_assessment.csv
# In some work we have assignments for accumulation rates set as defauly. Others may not.
accumulation: false
# How many processing cores should be used.
parallel: 3
A yaml
file is a special kind of markup that uses a key
: value
pairing, with #
to indicate a comment. We assign parameters to the various keys here so that we can define the behaviour of the model runs. In the following section we explain each parameter and its behaviour:
clean_run
: (default true
) Should we clear all the files and parameters and start from scratch? Can take either true
or false
.
version
: (default 1
) The use of a version key allows you to keep track of your Bacon parameters if, over the course of a project, you make changes in your choices about how you want to run your analysis. If the version number changes a new parameter file will be generated. You can assign any value to this parameter.
reset_parameters
: (default true
) Bulk generation of chronologies is an itterative process. In general default parameters will be set, and then, over the course of running bacon, parameters for individual cores might be changed. In the initial stages it might be worth modifying all default values in the code below and then setting reset_parameters
to true
so that the full table is re-written. Otherwise, edit the value to false
so that the table is not overwritten each time the scripts are run.
core_path
: (default Cores/
) Bacon needs to output files to a particular location, this tells Bacon where core files will be stored. You can use any directory in your working folder but that directory must exist.
date
: (default today
) In cases where files might be overwritten, the default is to overwrite but store the old file with a date-stamp. The default is to use the current date, but it might be that you want to associate a date stamp with a particular version. In that case you can change the setting from today
(which will use the current date) to a date of format YYYY-MM-DD
.
core_top_err
: (default 2
) An uncertainty to assign the core top age. While, in principle we often know the date with absolute certainty, Bacon requires uncertainty for all age samples.
modern
: (default true
) In Goring et al (2012) and elsewhere it is shown that modern sediment often has a higher accumulation rate than earlier sediment. One solution is to initiate an instantaneous hiatus at an identified “settlement” horizon, and then set the accumulation rate of the upper sediment to a higher acc.mean
rate. If modern
is set to true
, the program will place a hiatus at the modern horizon. If false
it will not.
thickness
: (default false
) There is a file included in the repository called data/paleon_thick.csv
, however, it may be the case that in previous runs you have adjusted thickness values. For example, in version
2 of your runs you could point to the params
file generated in version 1 by pointing to data/params/bacon_params_v1.csv
. If you modify thicknesses you must also modify the content of the chunk prior_thickness
below, so that you point to the correct thickness and dataset ID columns.
settlement
: (default false
) In Kujawa et al (2017) and in Dawson et al (2018) an expert-assessment exercise was undertaken to evaluate the presence and location of biostratigraphic markers for Euro-American settlement in the Upper Midwestern United States. This file is included for reference. With columns indicating datasetid
and depth
, if the parameter is set to a file location, such as data/expert_assessment.csv
, then any indicator of Settlement will be revised, otherwise (e.g., settlement: false
) the chronological controls for the records will remain as-is.
accumulation
: (default false
) Similar to thickness, if modifications have been made to accumulation rates, a file can be defined here.
parallel
: (default true
) Should the code try to run Bacon in parallel? This is either the value false
or the maximum number of cores to attempt (e.g., 3
). This means it is possible to run your script on fewer cores than your system has available to allow you to continue work on other projects. The code used in this script uses the mclapply()
function, which means that the parallel operations, which use parallel forking in R, are not available for Windows systems.
Throughout your workflow you will likely change some of these settings. For example, if you are changing the search parameters for your dataset search, you may want to keep some prior information, but increment the version
, or set clean_run
to true
, however, as you begin fine tuning your results, you will want to set clean_run
back to false
, and reset_parameters
to false
as well. This will prevent unneccessary errors, and reduce the amount of time required for the runs to complete, since it will accept completed runs and not attempt to re-run them.
source('R/setup_runs.R', echo=FALSE, verbose=FALSE)
There are a number of libraries used in this project. It is likely that you may be missing one or several. If this is the case you might be interested in this bash script for Mac or Linux. You may also be interested in the autoinst package for R. Otherwise, take some time to look at R/setup_runs.R
and ensure you install all packages listed.
Here we download the sedimentary data. The key component here is the get_dataset()
function. The newest version of neotoma
(version >= 1.7.3) allows a user to pass in a vector of dataset IDs, so you can either get datasets using a set of parameters in the get_dataset()
function (see help for the get_dataset()
function), or you can pass in a vector of dataset IDs (for example, if you’ve already picked a set of sites to be examined). In this example I am using the state, Michigan, as my search parameter:
# Define using state names here. Could also use altitudes, locations,
# dataset types. . .
dataset_list <- get_dataset(datasettype='pollen',
gpid = 'Utah',
ageyoung = 0)
# The load_downloads() function will download the datasets if either:
# there is not currently a downloaded object saved with the same dataset version
# or the setup parameter is TRUE, which calls for the whole data file to be re-written.
downloads <- suppressWarnings(load_downloads(dataset_list,
version = settings$version,
setup = settings$clean_run))
It might be the case that you want to do further validation on samples based on parameters in the downloads
object (e.g., only sites with a certain taxon). The downloads are required for the generation of the core depths file used by Bacon. If you want to further adjust the records, do so with the dataset object, not the downloads
. For example, you can remove elements of a list by setting them to NULL
, so if you know you want to remove the dataset in the 5th position, you could assign dataset_list[[5]] <- NULL
. You would then need to re-run load_downloads()
and ensure that setup
was TRUE
in load_downloads()
:
# This code is not run:
dataset_list <- get_dataset(gpid = 'Michigan')
# Only accept sites with a site description:
for (i in length(dataset_list):1) {
if (is.na(dataset_list[[i]]$site$description)) {
dataset_list[[i]] <- NULL
}
}
# We would go from approximately 810 datasets to 502.
downloads <- load_downloads(dataset_list,
version = settings$version,
setup = TRUE)
Using the neotoma
R package’s plot_leaflet()
function we can generate an interactive map of the dowloaded sites that can be used to explore the data further.
plot_leaflet(dataset_list)
Click on any point to obtain more information about the site. Note that this rendering is only interactive when the document is knit to an HTML file.
The Baconizing tool records parameters in a parameter data.frame
called params
. This data.frame
is generated and the saved to file to ensure that errors in the Bacon run do not affect any information stored in the parameters. The data.frame
includes a number of columns.
One unique element in this workflow is the use of two accumulation rates, one identified as mod
and one identified as old
. This comes from work by Goring et al. (2012) and Dawson et al (2016) who identifiy a break in accumulation rates at or around a European settlement horizon. This treatment of modern sediments is the default, however, the modern
flag in the settings.yaml
file can be set to false
if you do not wish to use these modern/historical settings.
Column | Default | Purpose |
---|---|---|
handle |
none | The unique text identifer for the dataset collection unit. |
datasetid | none | The unique dataset identifier. |
acc.mean.mod | 3.02 | Accumulation rate for modern sediment (above an identified Settlement Horizon). If the settings.yaml flag modern is set to false this will be the same as acc.mean.old . |
acc.mean.old | 15 | Accumulation rate for sediment if no horizon is identified, or the modern flag is set to false . |
acc.shape.mod | 0.53 | Shape parameter for modern sediment (as above). |
acc.shape.old | 0.9 | Shape parameter for sediement (as above). |
mem.strength | 2 | Memory (accumulation autocorrelation parameter) |
mem.mean | 0.5 | Mean value for memory. |
hiatus |
NA | Location of a hiatus, either set at settlement, or a known hiatus. |
thick |
5 | Section thickness (cm) to be used by Bacon. |
age.type |
NA | From the dataset, what is the original age type for the core chronology. |
run |
NA | Has the model been run by Bacon? |
suitable |
NA | Is the model suitable for a Bacon run? |
ndates |
NA | How many dates are used in the chronology? |
success |
NA | Did Bacon run and return a result successfully? |
notes |
. |
Any notes associated with the run. |
Any of these parameters can be changed by altering the values in the code chunk below (in the Rmd file) before running the document. If you have already run the document using a set of parameters and decide to change defaults, you can either change the version
in settings.yaml
or choose to set reset_parameters
to true
. Once you have run the full script and you have produced Bacon files you can manually edit individual sites by opening the data/params/bacon_params_version.csv
file.
After your first run of the script you can open the file at data/params/bacon_params_v1.csv to change the settings if the Bacon model appears poorly suited to the default settings. In this case you would change the appropriate setting (e.g., thick
), and also change run
to NA
and success
to NA
. This will
existing_params <- file.exists(paste0('data/params/bacon_params_v', settings$version, '.csv'))
if ((!existing_params) | settings$reset_parameters == TRUE) {
message("Writing a new parameters file.")
if (existing_params) {
file.copy(paste0('data/params/bacon_params_v', settings$version, '.csv'),
paste0('data/params/bacon_params_v', settings$version, '_', lubridate::round_date(lubridate::now("UTC"), unit="day"),'.csv'))
}
ds_handles <- sapply(dataset_list, function(x) { x$dataset.meta$collection.handle })
ds_ids <- as.integer(sapply(dataset_list, function(x) { x$dataset.meta$dataset.id }))
params <- data.frame(handle = ds_handles,
datasetid = ds_ids,
acc.mean.mod = 3.02,
acc.mean.old = 15.,
acc.shape.mod = 0.53,
acc.shape.old = 0.9,
mem.strength = 2.,
mem.mean = 0.5,
hiatus = as.numeric(NA),
thick = 5.,
age.type = as.character(NA),
run = FALSE,
suitable = NA,
ndates = as.integer(NA),
success = NA,
reliableold = NA,
reliableyoung = NA,
notes = ".",
stringsAsFactors = FALSE)
default <- params
readr::write_csv(x = params,
path = paste0('data/params/bacon_params_v', settings$version, '.csv'))
} else {
params <- suppressMessages(readr::read_csv(paste0('data/params/bacon_params_v', settings$version, '.csv'),
col_types = paste0(c('c','i', rep('n',8),'c', 'l','l','i','l','c', 'c','c'), collapse='')))
}
## Writing a new parameters file.
In the case that there has been a prior run or some prior assessment of thicknesses the user should set the thickness
value of settings.yaml
to point to the correct file location. Otherwise the value should be set to false
.
if(is.null(settings$thickness) | settings$thickness == FALSE | "character" %in% class(settings$thickness)) {
if ("character" %in% class(settings$thickness)) {
if (file.exists(settings$thickness)) {
params <- add_thickness(file = settings$thickness,
id_col = 1,
thick_col = 4,
parameters = params)
readr::write_csv(x = params,
path = paste0('data/params/bacon_params_v', settings$version, '.csv'))
} else {
stop("The file defined in `settings.yaml` does not exist.")
}
} else {
message("No thickness file is defined.")
}
} else {
stop("The setting for thickness in `settings.yaml` is not set correctly.")
}
## None of the dataset ids in the thickness file are in the parameter file.
## Modifying 0 records to update thicknesses.
In the case that there has been a prior run or some prior assessment of accumulation rates the user should set the accumulation
value of settings.yaml
to point to the correct file location. Otherwise the value should be set to false
.
if(is.null(settings$accumulation) | settings$accumulation == FALSE | "character" %in% class(settings$accumulation)) {
if ("character" %in% class(settings$accumulation)) {
if (file.exists(settings$accumulation)) {
params <- add_accumulation(file = settings$accumulation,
id_col = 1,
accum_col = 4,
parameters = params)
} else {
stop("The file defined in `settings.yaml` does not exist.")
}
} else {
message("No accumulation file is defined.")
}
} else {
stop("The setting for accumulations in `settings.yaml` is not set correctly.")
}
## No accumulation file is defined.
Here we begin to generate the csv
files for the cores to allow Bacon to run. This requires calling Neotoma’s API and then resolving the return so that it is turned into a csv
. There are several decisions that go into the construction of the CSV file. These decisions are documented both using the notes
column of the parameter file, and also commented in the file R/build_agefiles.R
. The main decisions are:
notes
field.notes
field.notes
notes
ageorder <- get_table('agetypes')
for (i in 1:nrow(params)) {
params[i, ] <- build_agefiles(param = params[i, ],
ageorder = ageorder,
datasets = dataset_list,
downloads = downloads,
settings = settings)
readr::write_csv(x = params,
path = paste0("data/params/bacon_params_v", settings$version, ".csv"))
}
The script is set up so that places where there is an objective choice, a clear notation of that choice is made and added to the parameter file’s notes
column as a semi-colon separated list. In addition to notes regarding the choice of age file, notes are also raised if:
settings.yaml
.)-60
, to be adjusted by the user).0
).Guess
, or Deglaciation
(age is dropped)Cores
directory associated with the record.The final step, once all the files have been written, is to run Bacon. Using the rbacon
package we can simplify everything:
params <- run_batch(params, settings = settings)
readr::write_csv(x = params,
path = paste0("data/params/bacon_params_v", settings$version, ".csv"))
The ultimate product here is a Cores directory with a folder for each dataset that generates a suitable chronology for Bacon, a parameters file (in data/params/
) that records the parameters used in the Bacon model, along with any notes about elements that may have been changed or were of note in the Bacon runs.
A file has been added to the standard Bacon output. In each directory the file handle
_XXX_posteriorout.csv
represents draws from the posterior provided by Bacon. You will see NAs in this result (but not in the Bacon ages
file)
Figure. Success (blue; n = 360) and failure (pink; n = 40) rate of the Bacon runs.
Here we summarize some elements of the run that was conducted with this document:
This interactive table allows you to page through your records for each site to examine the notes associated with each record. You can use the filters at the bottom of the table to search for individual records, or certain fields. The dataset IDs link to Neotoma Explorer and can be used to investigate sites further.
Blaauw, M., Christen, J.A., 2011. Flexible paleoclimate age-depth models using an autoregressive gamma process. Bayesian Analysis 6, 457–474.
Dawson, A., Paciorek, C.J., McLachlan, J.S., Goring, S., Williams, J.W., Jackson, S.T., 2016. Quantifying pollen-vegetation relationships to reconstruct ancient forests using 19th-century forest composition and pollen data. Quaternary Science Reviews 137, 156–175.
Goring, S.J., Graham, R., Loeffler, S., Myrbo, A., Oliver, J.S., Ormond, C., Williams, J.W., 2018. The neotoma paleoecology database: A research outreach nexus, Elements of paleontology. Cambridge University Press. doi:10.1017/9781108681582
Goring, S., Williams, J., Blois, J., Jackson, S., Paciorek, C., Booth, R., Marlon, J., Blaauw, M., Christen, J., 2012. Deposition times in the northeastern United States during the Holocene: Establishing valid priors for Bayesian age models. Quaternary Science Reviews 48, 54–60.
Kujawa, E.R., Goring, S.J., Dawson, A., Calcote, R., Grimm, E.C., Hotchkiss, S.C., Jackson, S.T., Lynch, E.A., McLachlan, J., St-Jacques, J.-M., Umbanhowar, C., Jr, Williams, J.W., 2016. The effect of anthropogenic land cover change on pollen-vegetation relationships in the American Midwest. The Anthropocene 15, 60–71.
Williams, J.W., Grimm, E.C., Blois, J., Charles, D.F., Davis, E., Goring, S.J., Graham, R.W., Smith, A.J., Anderson, M., Arroyo-Cabrales, J., Ashworth, A.C., Betancourt, J.L., Bills, B.W., Booth, R.K., Buckland, P., Curry, B.B., Giesecke, T., Jackson, S.T., Latorre, C., Nichols, J., Purdum, T., Roth, R.E., Stryker, M., Takahara, H., 2018. The Neotoma Paleoecology Database: A multi-proxy, international community-curated data resource. Quaternary Research 89, 156–177.