Published October 8, 2024 | Version v2
Conference paper Open

Dataplug: Unlocking extreme data analytics with on-the-fly dynamic partitioning of unstructured data

  • 1. Universitat Rovira i Virgili
  • 2. ROR icon Universidad Rovira i Virgili

Description

The elasticity of the Cloud is very appealing for processing large scientific data. However, enormous volumes of unstructured research data, totaling petabytes, remain untapped in data repositories due to the lack of efficient parallel data access. Even-sized partitioning of these data to enable its parallel processing requires a complete re-write to storage, becoming prohibitively expensive for high volumes. In this article we present Dataplug, an extensible framework that enables fine-grained parallel data access to unstructured scientific data in object storage. Dataplug employs read-only, format-aware indexing, allowing to define dynamically-sized partitions using various partitioning strategies. This approach avoids writing the partitioned dataset back to storage, enabling distributed workers to fetch data partitions on-the-fly directly from large data blobs, efficiently leveraging the high bandwidth capability of object storage. Validations on genomic (FASTQGZip) and geospatial (LiDAR) data formats demonstrate that Dataplug considerably lowers pre-processing compute costs (between 65.5% - 71.31% less) without imposing significant overheads.

Files

article.pdf

Files (738.1 kB)

Name Size Download all
md5:d460db3f29836d4f11a85bc7de858faa
738.1 kB Preview Download

Additional details

Funding

European Commission
NEARDATA – Extreme Near-Data Processing Platform 101092644
European Commission
CLOUDSTARS – Cloud Open Source Research Mobility Network 101086248
European Commission
EXTRACT – A distributed data-mining software platform for extreme data across the compute continuum 101093110
Ministerio de Asuntos Económicos y Transformación Digital
Cloudless Unico I+D Cloud 2022 Cloudless Unico I+D Cloud 2022
Universidad Rovira i Virgili
Martí Franquès 2021 2021PMF-PIPF-17