Working paper Open Access
Carole Goble; Stian Soiland-Reyes; Finn Bacall; Stuart Owen; Alan Williams; Ignacio Eguinoa; Bert Droesbeke; Simone Leo; Luca Pireddu; Laura Rodríguez-Navas; José Mª Fernández; Salvador Capella-Gutierrez; Hervé Ménager; Björn Grüning; Beatriz Serrano-Solano; Philip Ewels; Frederik Coppens
The practice of performing computational processes using workflows has taken hold in the biosciences as the discipline becomes increasingly computational. The COVID-19 pandemic has spotlighted the importance of systematic and shared analysis of SARS-CoV-2 and its data processing pipelines. This is coupled with a drive in the community towards adopting FAIR practices (Findable, Accessible, Interoperable, and Reusable) not just for data, but also for workflows, and to improve the reproducibility of processes, both manual and computational.
EOSC-Life brings together 13 of the Life Science ‘ESFRI’ research infrastructures to create an open, digital and collaborative space for biological and medical research. The project is developing a cloud-based workflow collaboratory to drive implementation of FAIR workflows across disciplines and RI boundaries, and foster tool- focused collaborations and reuse between communities via the sharing of data analysis workflows. The collaboratory aims to provide a framework for researchers and workflow specialists to use and reuse workflows. As such it is an example of the Canonical Workflow Frameworks for Research (CWFR) vision in practice.
EOSC-Life is made up of established research infrastructures ranging from biobanking and clinical trial management, through to coordinating biomedical imaging and plant phenotyping to multi-omic and systems-based data analysis. The heterogeneity of the disciplines is reflected in the diversity of their data analysis needs and practices and the variety of workflow management systems they use. Many have specialist platforms developed over years. Workflow management systems in common use include Galaxy, Snakemake, and Nextflow, and more specialist, domain-specific systems such as SCIPION.
To serve the needs of this established and diverse community, EOSC-Life has developed WorkflowHub as an inclusive workflow registry, agnostic to any Workflow Management System (WfMS). WorkflowHub aims to incorporate their workflows in partnership with the WfMS, to embed the registration of workflows in the community processes, e.g. based on pre-existing workflow repositories. The registry adopts common practices, e.g.use of GitHub repositories, and supports integration with the ecosystem of tool packages, assisted by registries (bio.tools, biocontainers), and services for testing and benchmarking workflows (OpenEBench, LifeMonitor).
As an umbrella registry, the Hub makes workflows Findable and Accessible by indexing workflows across workflow management systems and their native repositories, while providing rich standardized metadata. Interoperability and Reusability is supported by standardized descriptions of workflows and packaging of workflow components, developed in close collaboration with the communities. The WorkflowHub creates a place for registering and discovering libraries of workflows developed by collaborating teams, with suitable features for versioning, credit, analytics, and import/export needed to support the reuse of workflows, the development of sub-workflows as canonical steps and ultimately the identification of common patterns in the workflows.
At the heart of the collaboratory is a Digital Object framework for documenting and exchanging workflows annotated with machine processable metadata produced and consumed by the participating platforms. The Digital Object framework is founded on several needs:
Describing a workflow and its steps in a canonical, normalised and WfMS independent way: we use the Common Workflow Language (CWL), more specifically the Abstract CWL (non-executable) description variant to accompany the native workflow definitions. This presents the structure, composed tools and external interface in an interoperable way across workflow languages. WfMS can generate abstract CWL, already demonstrated for Galaxy, next to the ‘native’ Galaxy workflow description. This language duality is an important retention aspect of reproducibility, as the structure and metadata of the workflow can be accessed independent of its native format as CWL, even if that may no longer be executable, capturing the canonical workflow in a FAIR format. The co-presence of the native format enables direct reuse in the specific WfMS, benefitting from all its features.
Metadata about a workflow and its tools using a minimal information model: we use the Bioschemas profiles Computational Tool, Computational Workflow and Formal Parameter which are discipline independent, opinionated conventions for using schema.org annotations. Bioschemas enables us to capture and publish workflow registrations and their metadata as FAIR Digital Objects. The EDAM Ontology is further used to add bioinformatics-specific metadata, such as strong typing of inputs and outputs, within both Abstract CWL and Bioschemas annotations.
Organising and packaging the definitions and components of a workflow with their associated objects such as test data: we use a Workflow profile specialisation of RO-Crate, a community developed standardised approach for research output packaging with rich metadata. RO-Crate provides us the ability to package executable workflows, their components such as example and test data, abstract CWL, diagrams and their documentation. This makes workflows more readily re-usable. RO-Crate is the base unit of upload and download at the WorkflowHub. As CWFR Digital Objects of workflows, RO-Crates are activation-ready and circulated between the different services for execution and testing.
Identifiers for all the components: like FAIR Digital Objects, RO-Crates can be metadata-rich bags of identifiers and can themselves be assigned permanent identifiers. This enables the full description of a computational analysis, from input data, over tools and workflows, to final results.
Using these components we have built an environment that supports the Workflow Life Cycle, from abstract description, through to a specific rendering in a WfMS to its execution and the documentation of its run provenance, results and continued testing.