Published December 1, 2023 | Version v1
Conference paper Open

STAC for CEDA - Developing a scalable, standards-based search system

  • 1. Science and Technology Facilities Council

Description

The Centre for Environmental Data Analysis (CEDA) stores over 20 Petabytes of atmospheric and Earth observation data. Sources for the CEDA Archive include aircraft campaigns, satellites, automatic weather stations and climate models, amongst many others. The data mainly consists of well-described formats such as netCDF files but we also hold historical data where the format cannot be easily discerned from the file name and extension.

CEDA aims to implement a SpatioTemporal Asset Catalogue (STAC) to allow for user interfaces and search services to be enhanced and facilitate interoperability with user tools and our partners. We are working to create a full-stack software implementation including an indexing framework, API server, web and programmatic clients, and vocabulary management. All components are open-source so that they can be adopted and co-developed with other organisations working in the same space.

We have built the "stac-generator", a tool that can be used to create a STAC catalog, which utilises a plugin architecture to allow for more configurability. A range of input, output, and extraction methods can be selected to enable data extraction across the diverse archive data and its use by other organisations. Elasticsearch was chosen to host the indexed data because it is performant, highly scalable and supports semi-structured data - in our case the faceted search values related to different data collections. As STAC's existing API was backed by an SQL database this called for the development of a new ES backed STAC API. We have also developed several extensions to the STAC framework to meet requirements that weren't met by the core and community functionality. These include an end-point for interrogating the facet values, as queryables, and a free-text search capability across all properties held in the index.

The developments of our search system also include a pilot for a future version of the Earth System Grid Federation (ESGF) search service, in which we have created an initial index of CMIP6 data to investigate performance and functionality.

Files

PV2023_paper_4283 (2).pdf

Files (410.2 kB)

Name Size Download all
md5:dd0d1f03e441b195176a21dbaf37c95b
410.2 kB Preview Download