Published July 20, 2025 | Version 1.0
Report Open

2025 U.S. NSF CI Compass Virtual Workshop Report - Data Management: From Instrument to First Storage

  • 1. ROR icon University of North Carolina at Chapel Hill
  • 2. ROR icon Renaissance Computing Institute
  • 3. NSF ZEUS Laser Facility
  • 4. ROR icon University of Michigan–Ann Arbor
  • 5. EDMO icon Ocean Networks Canada
  • 6. NSF Network for Advanced NMR
  • 7. ROR icon UConn Health
  • 8. NSF MagLab
  • 9. ROR icon Florida State University
  • 10. ROR icon EarthScope Consortium
  • 11. ROR icon University of Notre Dame
  • 12. NSF Natural Hazards Engineering Research Infrastructure (NHERI)
  • 13. ROR icon Oregon State University
  • 14. ROR icon University of Southern California
  • 15. ROR icon Woods Hole Oceanographic Institution
  • 16. ROR icon Ocean Observatories Initiative
  • 17. ROR icon Metadata Game Changers (United States)
  • 18. ROR icon Texas Tech University
  • 19. ROR icon National Ecological Observatory Network
  • 20. ROR icon Indiana University Indianapolis
  • 21. ROR icon University of Wisconsin–Madison
  • 22. NSF IceCube Neutrino Observatory
  • 23. ROR icon Vera C. Rubin Observatory
  • 24. ROR icon LIGO Scientific Collaboration
  • 25. ROR icon National Solar Observatory
  • 26. ROR icon Argonne National Laboratory

Description

The 2025 Virtual Workshop on "Data Management: From Instrument to First Storage", organized by the U.S. National Science Foundation (NSF) CI Compass [1], the NSF Cyberinfrastructure Center of Excellence, brought together cyberinfrastructure (CI) professionals from the NSF Major and Midscale research facilities along with participants from the broader CI ecosystem to discuss issues of critical importance to the success of these facilities. The workshop focused on the crucial initial step of the data lifecycle for the NSF Major and Mid-scale Facilities - the step involving data acquisition and capture from scientific instrument(s). Speakers from a diverse range of facilities presented their unique challenges, best practices, and innovative solutions, which were then analyzed to identify common trends and future directions. This executive summary encapsulates the key insights and findings from the invited talks and subsequent discussions at the virtual workshop. 

The presentations at the workshop emphasized the immense data volumes and velocity of the data coming off the different kinds of instruments, and described the challenges and evolving best practices of managing petabyte-scale scientific data. Most of the scientific facilities represented at the workshop, e.g. the NSF/DOE Vera C. Rubin Observatory [2], NSF NEON [3], NSF NSO [4], NSF OOI [5], NSF LIGO [6], and NSF EarthScope [7], are generating and managing petabytes of data, with daily ingest rates often exceeding terabytes and billions of data points. The User Facilities like NSF MagLab [8], NSF NAN [9], and NSF ZEUS [10] have to not only collect and curate data from their instruments, but also help their users manage their specific experimental data. This immense scale and complexity demands robust, scalable infrastructure and sophisticated data management strategies.

Initial data acquisition and storage is often the first critical stage with data collection typically beginning with direct instrument readout and on-site buffering. Facilities use both real-time streaming (for time-critical data, e.g., OOI, LIGO) and batch transfers (for larger, less urgent datasets). In some remote or high-volume cases, physical shipment of storage media is still used. Data formats are highly diverse, ranging from specialized scientific formats (e.g. FITS [11]) to common standards (e.g. NetCDF [12], CSV, MP4). The high volume and velocity of the data acquisition often requires automation and rapid processing. Automation is essential for data movement/streaming, pipeline orchestration, and ingestion into appropriate data stores. Real-time processing and low-latency alerts are critical for many facilities (e.g., Rubin Observatory’s sub-2-minute alerts, LIGO’s transient event discovery). Automated quality control (QC), quality assurance (QA) and validation are often integrated early and close to the data capture time. Middleware (e.g. Kafka [13]) and orchestration (e.g. Kubernetes [14]) are widely used for managing streams, queues, and scalable services. Facilities balance on-premise and cloud storage, often using multi-tiered strategies to optimize cost, performance, and accessibility. There is a strong trend toward adoption of cloud technologies (e.g., EarthScope on Amazon AWS [15], NEON on Google GCP [16]) for scalability and managed services. A variety of database technologies are used: PostgreSQL [17] for metadata, Cassandra [18] and MongoDB [19] for raw data and inventory.

A majority of the presentations emphasized the importance of metadata annotation, data curation, and effective data dissemination, while preserving data security.  Rich, high-quality metadata is essential for findability, usability, and adherence to FAIR (Findable, Accessible, Interoperable, Reusable) [20] principles. Comprehensive QA/QC processes span the data lifecycle. Cybersecurity is a major concern, addressed via zero-trust architectures, VPN/SSO, and infrastructure-as-code. It was also noted that system-wide observability and monitoring (e.g., Grafana [21]) are critical for operational continuity. Facilities provide data through web portals, JupyterLab environments, and robust APIs (REST, GraphQL [22]). Persistent identifiers (DOIs) ensure data can be reliably found and cited. The facilities strongly promote open data policies, and making certain datasets and alert streams immediately available to the public. 

The workshop talks and discussions illuminated some key current challenges, which include managing ever-increasing data volumes and complexity, meeting low-latency demands, generating and curating metadata, enabling increased FAIR-ness of the data, addressing technical debt, and coping with staffing and funding limitations. Ensuring cybersecurity and data integrity as access expands needs to be an ongoing effort. Looking ahead, the community is embracing cloud-native solutions, expanding automation and machine learning for data processing, developing advanced data portals and APIs, and improving data discoverability across federated systems. The overarching goal the facilities are striving for is to build a connected, FAIR-aligned information ecosystem where scientific data is not only stored, but is discoverable, accessible, interoperable, and reusable for the global research community.

Files

2025 CI Compass Virtual Workshop Report.pdf

Files (8.5 MB)

Name Size Download all
md5:ab9e19c0eda573b0570a586df671c215
8.5 MB Preview Download

Additional details

Funding

U.S. National Science Foundation
CI CoE: CI Compass: An NSF Cyberinfrastructure (CI) Center of Excellence for Navigating the Major Facilities Data Lifecycle 2127548