Published October 20, 2025 | Version v1
Poster Open

CloudCatalog: an API plus Tools for Lazy Indexing of Millions of Cloud-Stored Data Files

  • 1. ROR icon Johns Hopkins University Applied Physics Laboratory
  • 1. EDMO icon Johns Hopkins University, Applied Physics Laboratory
  • 2. ROR icon Johns Hopkins University Applied Physics Laboratory
  • 3. NASA

Description

Indexing millions of files for easy, searchable yet serverless and decentralized access is hard.  CloudCatalog is a lightweight CSV- and JSON-based indexing schema enabling HAPI-like "data ID + time range" queries on massive cloud datasets, and includes an implementation of the API and support tools in Python. Key goals include that (1) data owners control their own indices, (2) indices are static files to avoid incurring server costs, (3) searching is efficient and (4) indices are easily constructable and maintainable by the scientists/data-owners (the 'lazy' part). In addition to the FAIR principles of findability, accessibility, interoperability, and reusability, it is serverless and decentralized so that contributors can publish and update their open science data without the worries of external gatekeeping or server maintenance.

We illustrate how to access the 1.5 Petabytes of HelioCloud AWS-cloud-stored data (a curated set from SPDF, VSO, and individual contributors) using CloudCatalog both within-cloud and externally (egress). And, how to use it in non-HelioCloud contexts for serving large collections of files, potentially by entities such as ESA DataLabs or Space Environment Canada as well as by individuals contributing science data via their own cloud storage. We will discuss performance issues and extensibility into richer search and query engines. We also solicit help in how best to enforce data id uniqueness as CloudCatalog-indexed holdings grow in scope.

Files

DASH2025_CloudCatalog_Antunes_l.pdf

Files (1.7 MB)

Name Size Download all
md5:c0979f2fdf36abe8a13e21c2ba5f08cb
1.7 MB Preview Download

Additional details

Dates

Accepted
2025-10-05
DASH poster

Software

Repository URL
https://github.com/heliocloud-data/cloudcatalog
Programming language
Python , JSON
Development Status
Active