High-Performance Access to Archival Data Stored in HDF4 and HDF5 on Cloud Object Stores Without Reformatting the Files
Authors/Creators
Description
Cloud computing offers numerous advantages for users of extensive Earth science data collections. These benefits encompass direct online access to data files and granules from any location, scalable access supporting parallel computing workflows, and flexible computing tools enabling innovative experimentation with processing techniques. However, older archival file formats designed for distinct computing systems hinder efficient access to decade-long timeseries data when compared to data stored in modern cloud-optimized formats like Web Object Stores (WOS), exemplified by Amazon Web Services’ Simple Storage Service (S3). We describe DMR++ (Dataset Metadata Response plus plus), a technology facilitating efficient access to HDF5 (Hierarchical Data Format, version 5) and HDF4 files stored on WOS systems without requiring data reformatting. DMR++ achieves performance comparable to technologies like Zarr while preserving the original file structure, a substantial benefit considering the vast quantity of archival files held by organizations such as NASA. Moreover, DMR++ typically outperforms cloud-optimized versions of HDF5. Essentially an XML (Extensible Markup Language) document usually stored alongside the described data, DMR++ can also be generated on-the-fly but is generally created during data staging to the WOS. Archival files that use HDF4/5 often store large arrays of numerical data. The data in these files is often compressed, typically reducing their size by a factor of four or more. To achieve efficient access to portions of those arrays, they are 'chunked' into smaller sub-arrays, each individually compressed. The chunk size is a compromise, where spinning disks can efficiently access data in smaller chunks while S3 favors larger chunks. A simple optimization of aggregating smaller chunks that are stored adjacently, transferring them in a single access and then individually decompressing them will improve performance. NASA data pose an additional challenge: special Application Programmer Interface (API) libraries are often needed to compute some variables. These libraries are incompatible with WOS environments. Our solution involves storing computed values in the DMR++ document or a companion file, making them accessible like other variables and eliminating the need for specialized APIs. We outline specific optimizations for both satellite grid and swath data stored in HDF4-EOS2 (Earth Observing System)
[Poster IN51F-243]
Files
Gallagher et al 2024_DMRpp-HDF-EOS2-Poster-AGU-2024-v2.1_KY_JL_JR_JG_Final (1).pdf
Files
(1.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:e437b6b0fabd9aa29a6a0973d67c5248
|
1.7 MB | Preview Download |