Published August 4, 2025 | Version v1
Conference paper Open

Enabling Peta-Scale Federated Repositories through Cloud-Native Formats: Lessons from a fast-paced challenge in the bioimaging community

  • 1. German BioImaging e.V, Society for Microscopy and Image Analysis
  • 2. scalable minds GmbH
  • 3. The Jackson Laboratory for Genomic Medicine
  • 4. Multiscale Bioimaging Cluster of Excellence "MBExC", University of Goettingen
  • 5. Center for Molecular Bioengineering, Technische Universität Dresden
  • 6. School of Life Sciences, University of Dundee
  • 7. The Francis Crick Institute
  • 8. RIKEN Center for Biosystems Dynamics Research
  • 9. European Molecular Biology Laboratory, European Bioinformatics Institute
  • 10. Universität Münster
  • 11. Yikes LLC
  • 12. Leibniz Institute for Neurobiology
  • 13. Cluster of Excellence "Physics of Life", Technische Universität Dresden
  • 14. Institute of Neurosciences and Medicine (INM-1), Forschungszentrum Jülich
  • 15. Glencoe Software, Inc.
  • 16. Heinrich Heine Universität Düsseldorf

Contributors

  • 1. Nationale Forschungsdateninfrastruktur (NFDI) e.V.
  • 2. University of Amsterdam

Description

As research disciplines increasingly generate large-scale imaging data, the need for robust, scalable, and interoperable data infrastructure has become paramount. Cloud-native data formats — specifically Zarr — are emerging as critical enablers for the creation of distributed, federated repositories that adhere to FAIR data principles. This proposal presents the outcomes of the OME2024 NGFF Challenge, an international community effort that demonstrated the viability of constructing such infrastructure for bioimaging data using OME-Zarr. The Open Microscopy Environment (OME) is an open-source, community-driven initiative that develops interoperable data formats, tools, and standards for biological imaging. As part of its commitment to open and FAIR research data, NFDI4BIOIMAGE actively contributes to OME, particularly the specification of OME-Zarr for cloud-native image storage. The challenge launched at the 2024 OME Annual Meeting in Dundee, Scotland and was designed to advance the maturity of the OME-Zarr format, particularly in conjunction with the new major version of the specification, Zarr v3, which improves the scalability through the use of sharding. Coordinated by NFDI4BIOIMAGE, international participants contributed converted datasets hosted on their own infrastructure to the challenge. Submissions were indexed using a lightweight CSV-based mechanism, with each row corresponding to a Zarr-formatted dataset at participating institutions. Participants agreed to complete the Challenge in time for the next major bioimaging community convening, the 2024 Global BioImaging Meeting, in Okazaki, Japan. During the four months of the Challenge, the community accumulated over 500TB of OME-Zarr data spanning multiple imaging modalities, all publicly accessible via HTTP. Importantly, these data were not centrally stored or managed; rather, each participating institution hosted its own data, forming a nascent federated repository. A centralized viewer was developed to aggregate and present the metadata from all submissions, providing search, filtering, and thumbnail browsing functionality, alongside integration with the OME-NGFF Validator for metadata validation and data preview. The success of the 2024 Challenge provides a compelling proof-of-concept for federated research data infrastructures underpinned by cloud-native data formats. With minimal centralized coordination and modest investments in tooling, the community effectively prototyped what is, to date, the largest known open, federated bioimage data system. This effort has demonstrated that the key technical and social building blocks for such infrastructures already exist and are operational. This presentation will reflect on the architectural and organizational lessons of the challenge, particularly how time-boxed cross-cutting activities can motivate development. We will explore how similar initiatives might be expanded and institutionalized through public investment. In particular, we will argue that future research data infrastructure strategies will increasingly depend on open, cloud-native formats to support distributed data sharing at scale. By adopting formats such as Zarr — which support efficient storage, access, and metadata representation of N-dimensional tensors in cloud environments — it is possible to quickly construct interoperable repositories that are both scalable and sustainable. As we roll out these formats across the community and take the next steps toward a truly scalable, federated bioimaging infrastructure, we invite others across disciplines to join this growing effort to help shape the future of open, interoperable research data.

Files

CoRDI_2025_paper_133.pdf

Files (186.7 kB)

Name Size Download all
md5:66e98a8a43cfdeb72b4a10eb67342709
186.7 kB Preview Download