Published February 25, 2023 | Version 1.1
Dataset Open

Packing provenance using CPM RO-Crate profile

  • 1. Masaryk University
  • 2. CRS4
  • 3. The University of Manchester

Description

This dataset is an RO-Crate that bundles artifacts of an AI-based computational pipeline execution. It is an example of application of the CPM RO-Crate profile,  which integrates  the Common Provenance Model (CPM), and the Process Run Crate profile.

As the CPM is a groundwork for the ISO 23494 Biotechnology — Provenance information model for biological material and data provenance standards series development, the resulting profile and the example is intended to be presented at one of the ISO TC275 WG5 regular meetings, and will become an input for the ISO 23494-5 Biotechnology — Provenance information model for biological material and data — Part 5: Provenance of Data Processing standard development.

Description of the AI pipeline

The goal of the AI pipeline whose execution is described in the dataset is to train an AI model to detect the presence of carcinoma cells in high resolution human prostate images. The pipeline is implemented as a set of python scripts that work over a filesystem, where the datasets, intermediate results, configurations, logs, and other artifacts are stored. In particular, the AI pipeline consists of the following three general parts:

  • Image data preprocessing. Goal of this step is to prepare the input dataset – whole slide images (WSIs) and their annotations – for the AI model. As the model is not able to process the entire high resolution images, the preprocessing step of the pipeline splits the WSIs into groups (training and testing). Furthermore, each WSI is broken down into smaller overlapping parts called patches. The background patches are filtered out and the remaining tissue patches are labeled according to the provided pathologists’ annotations.

  • AI model training. Goal of this step is to train the AI model using the training dataset generated in the previous step of the pipeline. Result of this step is a trained AI model.

  • AI model evaluation. Goal of this step is to evaluate the trained model performance on a dataset which was not provided to the model during the training. Results of this step are statistics describing the AI model performance.

In addition to the above, execution of the steps results in generation of log files. The log files contain detailed traces of the AI pipeline execution, such as file paths, model weight parameters, timestamps, etc. As suggested by the CPM, the logfiles and additional metadata present on the filesystem are then used by a provenance generation step that transforms available information into the CPM compliant data structures, and serializes them into files. 

Finally, all these artifacts are packed together in an RO-Crate.

For the purpose of the example, we have included only a small fragment of the input image dataset in the resulting crate, as this has no effect on how the Process Run Crate and CPM RO-Crate profiles are applied to the use case. In real world execution, the input dataset would consist of terabytes of data. In this example, we have selected a representative image for each of the input dataset parts. As a result, the only difference between the real world application and this example would be that the resulting real world crate would contain more input files. 

Description of the RO-Crate

Process Run Crate related aspects

The Process Run Crate profile can be used to pack artifacts of a computational workflow of which individual steps are not controlled centrally. Since the pipeline presented in this example consists of steps that are executed individually, and that the pipeline execution is not managed centrally by a workflow engine, the process run crate can be applied. 

Each of the computational steps is expressed within the crate’s ro-crate-metadata.json file as a pair of elements: 1) SW used to create files; 2) specific execution of that SW. In particular, we use the SoftwareSourceCode type to indicate the executed python scripts and the CreateAction type to indicate actual executions. 

As a result, the crate consists the seven following “executables”:

  • Three python scripts, each corresponding to a part of the pipeline: preprocessing, training, and evaluation.

  • Four provenance generation scripts, three of which implement the transformation of the proprietary log files generated by the AI pipeline scripts into CPM compliant provenance files. The fourth one is a meta provenance generation script.

For each of the executables, their execution is expressed in the resulting ro-crate-metadata.json using the CreateAction type. As a result, seven create-actions are present in the resulting crate.

Input dataset, intermediate results, configuration files and resulting provenance files are expressed according to the underlying RO Crate specification.

CPM RO-Crate related aspects

The main purpose of the CPM RO-Crate profile is to enable identification of the CPM compliant provenance files within a crate. To achieve this, the CPM RO-Crate profile specification prescribes specific file types for such files: CPMProvenanceFile, and CPMMetaProvenanceFile.

In this case, the RO Crate contains three CPM Compliant files, each documenting a step of the pipeline, and a single meta-provenance file. These files are generated as a result of the three provenance generation scripts that use available log files and additional information to generate the CPM compliant files. In terms of the CPM, the provenance generation scripts are implementing the concept of provenance finalization event. The three provenance generation scripts are assigned SoftwareSourceCode type, and have corresponding executions expressed in the crate using the CreateAction type.

Remarks

The resulting RO Crate packs artifacts of an execution of the AI pipeline. The scripts that implement individual steps of the pipeline and provenance generation are not included in the crate directly. The implementation scripts are hosted on github and just referenced from the crate’s ro-crate-metadata.json file to their remote location.

The input image files included in this RO-Crate are coming from the Camelyon16 dataset.

Files

rocrate.zip

Files (5.7 GB)

Name Size Download all
md5:3654836e51147c59b8072caa7be14406
5.7 GB Preview Download

Additional details

References

  • Litjens G; Bandi P; Bejnordi BE; Geessink O; Balkenhol M; Bult P; Halilovic A; Hermsen M; de Loo Rv; Vogels R; Manson Q; Stathonikos N; Baidoshvili A; Diest Pv; Wauters C; Dijk Mv; Laak Jv (2018): Supporting data for "1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset" GigaScience Database. http://dx.doi.org/10.5524/100439
  • Stian Soiland-Reyes, Peter Sefton, Mercè Crosas, Leyla Jael Castro, Frederik Coppens, José M. Fernández, Daniel Garijo, Björn Grüning, Marco La Rosa, Simone Leo, Eoghan Ó Carragáin, Marc Portier, Ana Trisovic, RO-Crate Community, Paul Groth, Carole Goble (2022): Packaging research artefacts with RO-Crate. Data Science 5(2) https://doi.org/10.3233/DS-210053
  • Rudolf Wittner, Cecilia Mascia, Matej Gallo, Francesca Frexia, Heimo Müller, Markus Plass, Jörg Geiger & Petr Holub (2022): Lightweight Distributed Provenance Model for Complex Real–world Environments. Scientific Data 9:503 (2022). https://doi.org/10.1038/s41597-022-01537-6