CloudCatalog: an API plus Tools for Lazy Indexing of Millions of Cloud-Stored Data Files
Contributors
Description
Indexing millions of files for easy, searchable yet serverless and decentralized access is hard. CloudCatalog is a lightweight CSV- and JSON-based indexing schema enabling HAPI-like "data ID + time range" queries on massive cloud datasets, and includes an implementation of the API and support tools in Python. Key goals include that (1) data owners control their own indices, (2) indices are static files to avoid incurring server costs, (3) searching is efficient and (4) indices are easily constructable and maintainable by the scientists/data-owners (the 'lazy' part). In addition to the FAIR principles of findability, accessibility, interoperability, and reusability, it is serverless and decentralized so that contributors can publish and update their open science data without the worries of external gatekeeping or server maintenance.
We illustrate how to access the 1.5 Petabytes of HelioCloud AWS-cloud-stored data (a curated set from SPDF, VSO, and individual contributors) using CloudCatalog both within-cloud and externally (egress). And, how to use it in non-HelioCloud contexts for serving large collections of files, potentially by entities such as ESA DataLabs or Space Environment Canada as well as by individuals contributing science data via their own cloud storage. We will discuss performance issues and extensibility into richer search and query engines. We also solicit help in how best to enforce data id uniqueness as CloudCatalog-indexed holdings grow in scope.
Files
DASH2025_CloudCatalog_Antunes_l.pdf
Files
(1.7 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:c0979f2fdf36abe8a13e21c2ba5f08cb
|
1.7 MB | Preview Download |
Additional details
Dates
- Accepted
-
2025-10-05DASH poster
Software
- Repository URL
- https://github.com/heliocloud-data/cloudcatalog
- Programming language
- Python , JSON
- Development Status
- Active