Published January 23, 2026 | Version v2
Data paper Open

A metadata managed FAIR end-to-end workflow for microbial community Omics data analysis

  • 1. ROR icon Wageningen University & Research

Contributors

Data manager:

Project manager:

Project member:

  • 1. ROR icon Wageningen University & Research

Description

Background: Molecular profiling using high-throughput ’omics technologies has tremendously increased our ability to interrogate complex microbial communities at the molecular level. In the context of data reuse, the FAIRification of these extensive datasets is frequently perceived as a secondary administrative task, addressed only after data analysis has been completed. However, this approach overlooks the potential benefits of early metadata integration as the procedures for
processing and analyzing raw data are primarily dictated by the underlying research design and experimental conditions. Gathering interoperable research metadata at the earliest stages creates a standardized basis for managing, processing, and analyzing data enabling more efficient and reproducible FAIR workflows.
Results: The single containment principle was used to develop modular containerized reproducible workflows that support the FAIR principles for research software by systematically capturing standardized metadata for each data processing step along with the resulting data products. Using defined mock metagenomic datasets as an example, we show that interoperable research metadata can be used to drive such computational workflows. By processing raw data accordingly,
machine-actionable provenance chains are created that enhance the reproducibility and reusability of the resulting data products.
Conclusions: A seamless integration of wet lab experiments with computational investigations is essential for a FAIR end-to-end research process. Meta-data-managed workflows prevent the need for unnecessary data manipulation. Workflow provenance registration explicates the complex multi-step methods employed for data processing and analysis. Combining FAIR principles with data provenance registration enhances the reusability of omics datasets by promoting transparency and reproducibility.

 

Data Availability

The datasets supporting the results of this article are available in the following repositories:

  • Test datasets:
    The mock community datasets (BMOCK12 and ZYMO) used for validation are available from their original publications [30,31].

  • Supplementary data files:
    The following supplementary files are deposited in the Zenodo repository [48] and are also included with this article:

    • Supplementary File S1: FAIR-DS experimental metadata in RDF/Turtle format, including ISA model structure and MIxS-compliant metadata for all mock communities.

    • Supplementary File S2: FAIR-DS experimental metadata in Excel format for human-readable access.

    • Supplementary File S3: MIMAG/MIxS-compliant metadata reports for all MAGs, including completeness, contamination, and taxonomic classification.

    • Supplementary File S4: CWL tool definition configuration files (YAML format) for all workflow runs.

    • Supplementary File S5: SPARQL query templates for extracting operational and quality metrics from GraphDB.

    • Supplementary File S6: Complete operational metadata for all workflow runs, including runtime statistics and tool execution times.

    • Supplementary File S7: Raw ANI matrices (pairwise values) for all three datasets.

    • Supplementary File S8: Complete workflow provenance data in RDF/Turtle format (PROV-O/CWLProv compliant). Filenames: ZYMO_EVEN_PROVENANCE.trig.gz, ZYMO_LOG_PROVENANCE.trig.gz, BMOCK12_PROVENANCE.trig.gz

    • Supplementary Files S9: Functional annotation data in RDF/Turtle format (GBOL ontology). Filenames: ZYMO_LOG_FUNCTIONAL_ANALYSIS.trig.gz, ZYMO_EVEN_FUNCTIONAL_ANALYSIS.trig.gz, BMOCK12_FUNCTIONAL_ANALYSIS.trig.gz

    • Supplementary File S10: GBOL data model schema in Mermaid format.

    • Supplementary File S11: GBOL data model schema in ShEx (Shape Expressions) format.

    • Supplementary File S12: Binning reproducibility analysis for Bacillus subtilis in ZYMO-LOG dataset, showing contig count variations across assembly strategies and replicate runs with SemiBin2.
    • Supplementary Figures S1–S3: ANI heatmaps for the ZYMO-EVEN, ZYMO-LOG, and BMOCK12 datasets.

    • Supplementary Figure S4: GBOL schema class diagram illustrating the structure of functional annotation data.

    The RDF datasets (Supplementary Files S8 and S9) can be loaded into any RDF-compatible triple store and queried using standard SPARQL tools. Example SPARQL queries are provided in Supplementary File S5. The RDF data use standard ontologies (PROV-O [42], CWLProv [29], and GBOL), ensuring interoperability and enabling integration with other FAIR-compliant datasets. The complete GBOL data model schema is provided in Supplementary Files S10 and S11 and visualized in Supplementary Figure S4.

  • Workflow code and analysis notebooks:
    The workflow source code and Jupyter notebooks used for data analysis, figure generation, and table preparation are available on GitLab at:
    https://git.wur.nl/unlock/projects/FAIRwf4MicrobialCommunity

  • Workflows:
    The workflow definitions are available on WorkflowHub [49], and their source code is hosted on GitLab at:
    https://gitlab.com/m-unlock/cwl

Files

Supplementary_File_S12_Binning_Reproducibility.csv

Files (2.3 kB)

Additional details

Software

Repository URL
https://git.wur.nl/unlock/projects/FAIRwf4MicrobialCommunity
Programming language
Python , Shell , SPARQL
Development Status
Active