Abstract—In this paper we describe our experience adopting the Research Object Bundle (RO-Bundle) format with BagIt serialization (BagIt-RO) for the design and implementation of “tales” in the Whole Tale platform. A tale is an executable research object intended for the dissemination of computational scientific findings that captures information needed to facilitate understanding, transparency, and re-execution for review and computational reproducibility at the time of publication. We describe the Whole Tale platform and requirements that led to our adoption of BagIt-RO, specifics of our implementation, and discuss migrating to the emerging Research Object Crate (RO-Crate) standard.
Index Terms—Reproducibility of results, Standards, Packaging, Interoperability, Software, Digital preservation
Whole Tale (http://wholetale.org) is a web-based, open-source platform for reproducible research supporting the creation, sharing, execution, and verification of “tales” [2, 5]. Tales are executable research objects that capture the code, data, and environment along with narrative and workflow information needed to re-create computational results from scientific studies. A goal of the Whole Tale platform (WT) is to produce an archival package that is exportable, publishable, and can be used for verification of computational reproducibility, for example as part of the peer-review process.
Since its inception, the Whole Tale platform has been designed to bring together existing open science infrastructure. Researchers can ingest existing data from various scientific archival repositories; launch popular analytical tools (such as Jupyter and RStudio); create and customize computational environments (using repo2docker1); conduct analyses; create/upload code and data; and publish the resulting package back to an archival repository. Tales are also downloadable and re-executable locally, including the ability to retrieve remotely published data.
With the May 2019 release of version 0.7 of the platform we adopted the Research Object Bundle BagIt serialization (BagIt-RO) format [13]. By combining the BagIt-RO serialization with our repo2docker-based execution framework and the BDBag tools [4], we were able to define and implement a standards-compliant, self-describing, portable, re-executable research object with the ability to retrieve remotely published data.
In this paper we describe the Whole Tale platform and requirements that led to our adoption of the the BagIt-RO format. The paper is organized as follows. In section II, we present a motivating example of the use of the Whole Tale platform followed by a brief description of the system architecture in section III. In section IV we outline the requirements that led to our adoption of the BagIt-RO format. In section V we describe our implementation in more detail followed by a discussion and conclusions.
II. Example scenario: Analyzing seal migration patterns
We begin with a motivating example to illustrate the end-to-end Whole Tale workflow for creating, exporting, and publishing a tale based on existing data archived using the Research Workspace2, a DataONE member node. This example is based tutorial material described in [12].
A research team is preparing to publish a manuscript describing a computational model for estimating animal movement paths from telemetry data. The source data for their analysis, tracking data for juvenile seals in Alaska [3], has been published in Research Workspace, a DataONE network member. Using the Whole Tale platform, the researchers register the external dataset. They then create a new tale by launching an RStudio environment based on images maintained by the Rocker Project [1]. Using the interactive environment, they clone a Github repository, modify an R Markdown document, customize the environment by specifying OS and R packages via repo2docker configuration files, and execute their code to generate outputs. They download the package in a compressed BagIt-RO format and run locally to verify their tale. Finally, they enter descriptive metadata and publish the final package back to DataONE to archive the package and obtain a persistent identifier to include in publication.
This scenario is further illustrated in Figure II.
This section provides a brief overview of the Whole Tale system architecture illustrated in Figure III. Whole Tale provides a scalable platform based on the Docker Swarm container orchestration system, exposing a set of core services via REST APIs and Single Page Application (SPA). Key components include:
Several aspects of the Whole Tale system are related to the BagIt-RO serialization format including filesystem organization, user-defined environments, metadata as well as the export and publication functions. We describe these in more detail below.
Each tale has a workspace (folder) that contains user-created code, data, workflow, documentation and narrative information. The workspace also contains repo2docker-compatible configuration files defining the tale environment, described below. This appears as the workspace folder mounted into the running tale environment.
Optionally, each tale can include references to externally published data. The data is then registered with the Whole Tale system and managed by the DMS. Externally referenced data appears in the data folder, a sibling to the workspace.
Users can optionally customize the tale environment using repo2docker-compatible configuration files. Whole Tale extends repo2docker via the repo2docker_wholetale4 package, which adds buildpacks to support Rocker, Spark, and OpenRefine images.
Tales have basic descriptive metadata including creator, authors, title, description, keywords as well as information about the selected environment, licenses, and associated persistent identifiers. The tale metadata is included in the metadata directory both in the manifest.json and environment.json files. The license is included in the BagIt payload directory, but not as part of the tale workspace.
Tales can be exported in a BagIt-RO serialized archive that contains the contents of the tale workspace (code, local data, narrative, workflow, repo2docker configuration files) as well as references to external data, tale metadata, and a script to run the tale locally. BDBag [4] is used to materialize “holey” bags by downloading files specified in the fetch.txt file, initially via HTTP and eventually via DOI, Globus, Agave schemes. The script to run locally is stored at the root of the exported BagIt archive.
Table 1 describes the contents of an exported tale in the BagIt-RO format. A complete example is available at https://doi.org/10.5281/zenodo.2641314.
File | Description |
bag-info.txt | Bag metadata using the bdbag-ro-profile5 |
bagit.txt | Bag declaration |
data/ LICENSE workspace/ apt.txt postBuild requirements.txt wt_quickstart.ipynb
| Payload directory containing tale license and workspace contents including repo2docker compatible configuration files. |
fetch.txt | Fetch file |
manifest-[md5, sha256].txt | Payload manifest (checksums) |
metadata/ manifest.json environment.json
| Tag directory containing RO manifest.json and Whole Tale environment metadata (required by repo2docker_wholetale) |
tagmanifest-[md5, sha256].txt | Tag manifest (checksums) |
README.md | Tale top-level readme |
run-local.sh | Tale local execution script |
The scenario described in section II highlights several key requirements of the Whole Tale platform that led to our selection of the BagIt-RO serialization. These requirements include:
V. Adopting the BagIt-RO Model
Whole Tale uses the RDF data model to encode tale information for export and exchange. We selected a JSON-LD representation for human readability, extensibility, compatibility with Whole Tale APIs, and potential interoperability with search engines and third party publishers. After developing an ad-hoc internal format, we explored emerging standards in the research object space and settled on BagIt-RO for serialization. Using RO-Bundle specification and BagIt serialization in conjunction with the BDBag tools met many of our initial requirements. Additional tale metadata attributes which were not included in the BagIt-RO model could be added using vocabularies such as schema.org. Throughout this section, we use the manifest.json from the above example, with a complete listing included in Appendix A.
One strong point of RO-Bundle is that it treats file system artifacts as aggregates of the manifest. Doing so satisfies our requirement of being able to track where files belong, enabling us to both export and re-import tales even in the case where we must publish a hierarchical structure to a repository that can only represent a flat structure. In the case of Whole Tale, artifacts include data that were retrieved from external repositories as well as files that the user created or uploaded into the tale workspace. The tale workspace contents are included in the payload ”data/workspace” directory and the external data are fetched into the payload ”data/data” directory, mirroring filesystem organization on the web-based platform.
Workspace artifacts are easily described with a single URI entry. Some files, such as the system generated README.md, are tagged with additional metadata as shown below. In this case the additional metadata specifies the “type” of the file as a “HowTo”.
Whole Tale supports two types of external data: data that reside in a repository identified by persistent identifier (e.g., DOI) and data that exists at a generic HTTP address. In addition to including information about external data in the manifest.json, the URL for each remote file, regardless of type, is included in the fetch.txt for retrieval using BDBag tools.
Generic HTTP Data: For data that does not belong to a remote repository, a simple bundle is created in the aggregation section. The URI points to the HTTP address where the file may be retrieved and the bundle object holds the filesystem relevant information. The combination of information allows us to retrieve the file and place it in the correct folder (i.e., data/data).
Repository Data: For datasets that have been published to research repositories, additional metadata can be ingested when files are registered with the system. The individual files are described with a single bundle object, and linked to an additional structure that describes the dataset in more detail.
The following snippet describes a remote dataset that resides in DataONE and the aggregation recording the relationship between a file in that dataset and its ultimate location after retrieval in the payload ”data” directory:
C. Describing the Computing Environment
Whole Tale uses a customized version of the Binder repo2docker package. In addition to including configuration files in the workspace, Whole Tale exports information about the environment including runtime information in the tale. One shortcoming of the BagIt-RO model is that there is not a well-defined place for this metadata. To address this need, we define an additional tag file, environment.json, which encodes sufficient information about the environment so that it can be re-created. The metadata contained in this file is represented as JSON and is not described using standard vocabularies.
D. Describing Additional Attributes
A number of properties that describe additional tale attributes (e.g., authors, keywords, description, license) are defined at the manifest root. Schema.org’s vocabulary sufficed for describing these general metadata fields.
Attributing authorship to a tale is a requirement for tracking researcher contributions and is also used during metadata generation with publishers. The Provenance, Authoring, and Versioning (PAV) vocabulary is used instead of schema because it is already included in by RO-Bundle:
A planned feature of Whole Tale is the ability to track executions and steps in researchers’ workflows, which we call computational provenance. The BagIt-RO model includes the ability to provide provenance information through the inclusion of the provenance.json file. However, this is intended to capture more archival provenance information and it is unclear whether computational provenance should be included here. Whole Tale plans to use the Prov-ONE ontology6, an extension to W3C PROV7.
The URI of each file in the manifest can be referenced inside the provenance.json file, enabling rich linkings of information. This information can also be transcribed to publisher-specific formats, provided that they support PROV. Figure 3 illustrates how provenance information is rendered in DataONE.
In this section, we highlight and discuss several issues related to our implementation of BagIt-RO that we hope will be of interest to workshop participants and possible input into current work on the RO-Crate specification. We discuss the importance of re-executability; the ability to reference and retrieve external data; the relationship between tales and source control repositories; and our ongoing work on computational provenance and verification workflows.
A. Executable research objects
Tales are executable research objects. By this we mean that the research object itself may be built and re-executed for exploration, re-use, reproducibility, and verification. This is not a unique capability as many systems have recently been developed to support the creation of similar artifacts (for example Binder, CodeOcean). Executable research objects contain not only data, code, and documentation, but also information about the computational environment. This executability leads to additional capabilities, such as generation and comparison of computational provenance or methods of automated verification.
The FAIRDOM infrastructure initiative has made use of the Research Object framework to employ a standards based method to group its components into container platforms including BagIt [14]. We extend this approach into the Whole Tale framework and include the capability for externally referenced data and general research pipelines. Our efforts are more general than ReproZip, which gathers and bundles dependencies for command line executions [6]. The Collective Knowledge (CK) framework gathers research objects with unique IDs and metadata in the JSON format but does not ensure re-executability [11]. Sciunits on the other hand are self-contained bundles guaranteed to re-execute regardless of deployment, and targeted at scientific experiments [21, 22].
In the Whole Tale platform, users are presented with a fixed filesystem hierarchy that includes “workspace” and “data” directories. The workspace directory contains code, local data, and additional files (e.g., documentation) and the sibling ”data” directory contains externally referenced data files (read-only).
In our v0.7 release, the BagIt payload directory of an exported tale similarly contains “workspace” and “data” directories. The manifest.json contains information about remotely registered datasets that is also included in the BagIt fetch.txt. When BDBag tools are used to fetch remote datasets, they are downloaded to the payload/data directory, matching the online filesystem organization and system capabilities. The concept of the fetch.txt, while primitive, is surprisingly effective when used with BDBag. We also foresee taking advantage of other BDBag capabilities, such as transferring Globus data or using DOI resolution. However, there is redundancy in tracking external information in both in the BagIt fetch.txt and the RO manifest.json.
Many researchers use source control repositories (e.g., GitHub) to organize and collaborate on research projects. Repositories can be released and published via external tools such as Zenodo or Whole Tale. In the Whole Tale platform, the “workspace” directory can be mapped to a version controlled repository. This raises the question of whether or not the workspace (or repository) should contain everything, including information currently stored in the manifest.json or environment.json. This information is essential to the understandability and re-executability of the tale, but is currently modeled as external to the primary tale contents (as is common with descriptive metadata). During the local execution process, for technical reasons we bind mount files from the “metadata” directory into the workspace to support building the tale image. In future releases, we are considering exposing the manifest information along with computational provenance information (below) as part of the workspace instead of external to it. This means that even simple metadata would be in the workspace and easily added to version control.
D. Reproducibility and computational provenance information
Computational provenance refers to methods of capturing provenance (“the source or origin of an object”) for computational tasks [10] and is a subset of the larger notion of reproducibility of data- and computationally-enabled results [15, 19, 17, 20, 18]. We are beginning to explore methods of capturing and storing computational provenance information to enable reproducibility on computational findings in tales. In the RO-Bundle specification, provenance information is defined as “describing creators, dates, and sources” and is more concerned with the provenance of the research object itself, which we term archival provenance. Computational provenance information is internal to the tale and could be generated by the user or the Whole Tale system directly. We view computational provenance information as a key component of transparency for evaluation and verification of tales and part of enabling reproducibility.
E. Supporting reproducibility via verification workflows
Research communities and journals are increasingly adopting artifact review processes that include re-execution of computational analysis in support of reproducibility [16]. Examples include the workflow implemented by the Odum Institute for the American Journal for Political Science [7], the Journal of the American Statistical Association8, Biostatistics [9], and the ACM Transactions on Mathematical Software (TOMS) Replicated Computational Results9 program. We see tales and related research objects being used to simplify and possibly automate aspects of the verification process. Having a standard format for the exchange of research objects that fits into these enhanced curatorial and verification workflows may significantly reduce the burden on research communities.
One drawback of the BagIt serialization is that the BagIt configuration is foregrounded and difficult to understand for the average researcher/user while the “payload” directory is less apparent and confusingly named “data”. Although out of scope for the RO discussion, we are supportive of the idea of a “.bagit” directory that contains the relevant configuration information and is largely hidden from the average user.
Since our adoption of the BagIt-RO model, the community has moved forward on the Research Object Crate (RO-Crate) specification10. In this section, we report the results of a preliminary analysis of changes needed to migrate to the new format. Doing so will require versioning the tale export format and we are unlikely to make changes until the community settles on a near-final version of the specification.
RO-Crate 0.2-DRAFT introduces the following changes from the RO-Bundle 1.0
RO-Crate promises many benefits that align with Whole Tale, namely the adoption of schema.org as the primary vocabulary and its ability to be used alongside a variety of serialization formats.
By implementing an extension to RO-Bundle with BagIt serialization and leveraging existing open science infrastructure tools including repo2docker and BDBag, we were able to effectively create an exportable, publishable, and executable research object package, in short taking a step toward the publication of “really reproducible research” [8]. While not a perfect fit, BagIt-RO met many of our platform requirements. We expect to continue work in this area as we add support for computational provenance information and automated verification and hope to contribute to the use cases and discussions that inform the development of a broader community standard.
Acknowledgment
This work is supported by National Science Foundation Award OAC-1541450.
[1] C. Boettiger and D. Eddelbuettel. An introduction to rocker: Docker
containers for R. CoRR, abs/1710.03675, 2017.
[2] A. Brinckman, K. Chard, N. Gaffney, M. Hategan, M. B.
Jones, K. Kowalik, S. Kulasekaran, B. Ludäscher, B. D. Mecum,
J. Nabrzyski, et al. Computing environments for reproducibility:
Capturing the “whole tale”. Future Generation Computer Systems,
94:854–867, 2019.
[3] M. Cameron, J. London, K. Frost, A. Whiting, and P. Boveng.
Satellite telemetry dataset (raw): Juvenile bearded and spotted seals,
2004-2006, kotzebue, alaska, 2018.
[4] K. Chard, M. D’Arcy, B. Heavner,
I. Foster, C. Kesselman, R. Madduri, A. Rodriguez, S. Soiland-Reyes,
C. Goble, K. Clark, E. W. Deutsch, I. Dinov, N. Price, and A. Toga.
I’ll take that to go: Big data bags and minimal identifiers for exchange
of large, complex datasets. In 2016 IEEE International Conference on
Big Data (Big Data), pages 319–328, Dec 2016.
[5] K. Chard, N. Gaffney, M. B.
Jones, K. Kowalik, B. Ludäscher, J. Nabrzyski, V. Stodden, I. Taylor,
M. J. Turk, and C. Willis. Implementing computational reproducibility
in the whole tale environment. In Proceedings of the 2Nd International
Workshop on Practical Reproducible Evaluation of Computer Systems,
P-RECS ’19, pages 17–22, New York, NY, USA, 2019. ACM.
[6] F. Chirigati, R. Rampin, D. Shasha, and J. Freire. Reprozip:
Computational reproducibility with ease. In Proceedings of the 2016
International Conference on Management of Data, SIGMOD ’16, pages
2085–2088, New York, NY, USA, 2016. ACM.
[7] T.-M. Christian, S. Lafferty-Hess, W. G. Jacoby, and T. Carsey.
Operationalizing the replication standard. IJDC, 13(1):114–124, 2018.
[8] J. F. Claerbout and M. Karrenbach. Electronic documents give
reproducible research a new meaning. In SEG Technical Program
Expanded Abstracts 1992, pages 601–604. Society of Exploration
Geophysicists, 1992. doi:10.1190/1.1822162.
[9] D. L. Donoho. An invitation to reproducible computational research.
Biostatistics, 11(3):385–388, 07 2010.
[10] J. Freire, D. Koop, E. Santos, and C. T. Silva. Provenance for
computational tasks: A survey. Computing in Science and Engg.,
10(3):11–21, May 2008.
[11] G. Fursin, A. Lokhmotov, D. Savenko, and E. Upton. A collective
knowledge workflow for collaborative research into multi-objective
autotuning and machine learning techniques. CoRR, abs/1801.08024,
2018.
[12] J. M. London and D. S.
Johnson. Alaska bearded and spotted seal example dataset and analysis.
https://github.com/jmlondon/crwexampleakbs, 2019.
[13] S. Soiland-Reyes, M. Gamble, and R. Haines. Research object bundle
1.0, researchobject.org recommendation.
https://w3id.org/bundle/2014-11-05/, 2014.
[14] N. Stanford, F. Bacall, M. Golebiewski, O. Krebs, R. Kuzyakiv,
Q. Nguyen, S. Owen, S. Soiland-Reyes, J. Straszewski, D. van
Niekerk, A. Williams, K. Wolstencroft, L. Malmström, B. Rinn,
J. Snoep, W. Müller, and C. Goble. Fairdom: Reproducible systems
biology through fair asset management. 2016. Reproducibility, standards
and SOP in bioinformatics : Combined CHARME – EMBnet and
NETTAB 2016 Workshop, NETTAB ; Conference date: 25-10-2016
Through 26-10-2016.
[15] V. Stodden. Reproducible research: tools and strategies for scientific
computing. Computing in Science and Engineering, 14:11–12, 2012.
[16] V. Stodden, P. Guo, and Z. Ma. Toward reproducible computational
research: an empirical analysis of data and code policy adoption by
journals. PLOS ONE, 8(6):e67111, June 2013.
[17] V. Stodden, F. Leisch, and R. D. Peng. Implementing Reproducible
Research. CRC Press, Apr. 2014.
[18] V. Stodden, M. McNutt, D. H.
Bailey, E. Deelman, Y. Gil, B. Hanson, M. A. Heroux, J. P. Ioannidis,
and M. Taufer. Enhancing reproducibility for computational methods.
Science, 354(6317):1240–1241, 2016.
[19] V. Stodden and S. Miguez. Best practices for computational science: Software infrastructure and environments for reproducible and extensible
research. Journal of Open Research Software, 2, 2014.
[20] V. Stodden, S. Miguez, and J. Seiler.
Researchcompendia.org: Cyberinfrastructure for reproducibility and
collaboration in computational science. Computing in Science and
Engineering, 17(1):12–19, 2015.
[21] D. H. T. That, G. Fils, Z. Yuan, and T. Malik. Sciunits: Reusable
research objects. CoRR, abs/1707.05731, 2017.
[22] Z. Yuan, D. H. T. That, S. Kothari, G. Fils, and T. Malik. Utilizing
provenance in reusable research objects. Informatics, 5:14, 2018.
1https://repo2docker.readthedocs.io/
2https://www.researchworkspace.com
3https://girder.readthedocs.io
4https://github.com/whole-tale/repo2docker_wholetale
5https://raw.githubusercontent.com/fair-research/bdbag/master/profiles/bdbag-ro-profile.json
6https://purl.dataone.org/provone-v1-dev
7https://www.w3.org/TR/prov-overview/
8https://magazine.amstat.org/blog/2016/07/01/jasa-reproducible16/
9http://toms.acm.org/replicated-computational-results.cfm
10https://researchobject.github.io/ro-crate/