Towards a sustainable astronomical data infrastructure: Optimising linking data from the Rucio datalake to the users areas within the SKA Regional Centres Network
Authors/Creators
- 1. Extragalactic Astronomy, Instituto de Astrofisica de Andalucia, Granada, Andalusia, 18008, Spain
- 2. Battcock Centre for Experimental Astrophysics, University of Cambridge Department of Applied Mathematics and Theoretical Physics, Cambridge, England, CB3 0HE, UK
- 3. Institute for Data Science, Hochschule fur Technik FHNW, Windisch, Aargau, 5210, Switzerland
- 4. SKA Organisation, Macclesfield, England, SK11 9FT, UK
Description
The distributed architecture of the SKA Regional Centre Network (SRCNet) aims to provide scientific communities worldwide with efficient computational and storage resources to exploit the massive data volumes produced by the SKA Observatory (SKAO). Given the amount of SKAO data, traditional data management paradigms — where data is transferred to computational resources— are no longer feasible. Instead, computational workflows must increasingly be relocated closer to data storage locations, emphasizing efficient data access strategies and avoiding unnecessary duplication or redundancy. In this context, we present PrepareData, a modular and extensible data delivery service developed within SRCNet prototyping activities. Our proposal for this service addresses the critical challenge of redundant data transfers and duplication at both node and user levels by enabling seamless delivery of requested datasets from local Rucio Storage Elements (RSEs) directly into users' working environments. PrepareData operates as a local service within each SRCNet node and it is integrated into a broader ecosystem of federated services. Specifically, we designed and evaluated two distinct yet complementary implementations to avoid unnecessary data duplication and to enable a dynamic data bridge between the RSEs and the user storage areas, through: (1) a filesystem-based solution leveraging CephFS, which uses shared filesystem mount points and bind mounts to ensure consistent and immediate data availability of the data across computational nodes, and (2) a Kubernetes model using Persistent Volumes and Persistent Volume Claims, dynamically injecting data into a user's areas. To tackle this work we detail the architectural design and development, the technical implementation, the integration of both solutions with science enabling tools, such as JupyterHub, CARTA or virtually any application, and finally we provide a performance evaluation. This contribution provides a scalable and sustainable blueprint for data delivery in federated scientific infrastructures, supporting the broader goals of green computing and efficient resource utilisation.
Modern scientific projects such as the Square Kilometre Array Observatory (SKAO) produce extremely large amounts of data — far more than traditional research systems have handled before. Scientists around the world need efficient ways to access and work with these vast datasets without unnecessary delays or wasteful copying of information. In the SKA Regional Centre Network (SRCNet), data are stored in distributed storage systems. However, these storage systems are not directly visible to the software tools scientists use for analysis, such as interactive notebooks, visualisation platforms, or specialised processing services. To make the data usable, it must be moved or linked into the scientists' working environments. Our research focuses on a service called PrepareData, designed to manage this delivery process in a way that is both fast and resource-efficient. We describe and test different technical methods for exposing data to users, including techniques that avoid repeatedly copying large files. We ran experiments to measure how fast and scalable each method is under realistic conditions. The results show that by linking data instead of copying it, PrepareData can reduce delays and lower the burden on storage systems. This leads to better performance for scientists and less wasted computing and storage resources. These improvements are especially important for major international science projects, where efficient data delivery can accelerate research and reduce environmental impact.
Files
openreseurope-6-25562.pdf
Files
(1.1 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:7aee2e42d8cab80a809f9eeef1cbcfd6
|
1.1 MB | Preview Download |
Additional details
References
- (2013). SKAO foundational document.
- Scaife AMM (2020). Big telescope, big data: Towards exascale with the Square Kilometre Array. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. doi:10.1098/rsta.2019.0060
- Barisits M, Beermann T, Berghaus F (2019). Rucio: Scientific data management. Computing and Software for Big Science. doi:10.1007/s41781-019-0026-3
- Fuhrmann P, Gülzow V (2006). European Conference on Parallel Processing. doi:10.1007/11823285_116
- Peters AJ, Sindrilaru EA, Adde G (2015). EOS as the present and future solution for data storage at CERN. J Phys Conf Ser. doi:10.1088/1742-6596/664/4/042042
- Bonaldi A, Brüggen M, Burkutean S (2020). Square Kilometre Array science data challenge 1: Analysis and results. Mon Not R Astron Soc. doi:10.1093/mnras/staa3023
- (2025). SRCNet v0.1 implementation plan (Technical Report SRC-0000009-01).
- (2025). SRCNet use cases (Technical Report SRC-0000004-01).
- Gaudet S, Hill N, Armstrong P (2010). Software and Cyberinfrastructure for Astronomy (SPIE).
- Hartley P, Bonaldi A, Braun R (2023). SKA science data challenge 2: Analysis and results. Monthly Notices of the Royal Astronomical Society. doi:10.1093/mnras/stad1375
- Haarlem MP, Wise MW, Gunst AW (2013). LOFAR: The low-frequency array. Astronomy & Astrophysics. doi:10.1051/0004-6361/201220873
- Villard E (2025, September 12). Advanced data products for radio observatories. Astron Comput. doi:10.1016/j.ascom.2025.101003
- Elmsheuser J, Di Girolamo A (2019). Overview of the ATLAS distributed computing system. EPJ Web Conf. doi:10.1051/epjconf/201921403010
- Mkrtchyan T, Chitrapu K, Garonne V (2021). dCache: Inter-disciplinary storage system. EPJ Web of Conferences. doi:10.1051/epjconf/202125102010
- Wozniak JM, Sharma H, Armstrong TG, Wilde M, Almer JD, Foster I (2014). 2014 IEEE/ACM International Symposium on Big Data Computing. doi:10.1109/BDC.2014.18
- Abbasi H, Wolf M, Eisenhauer G (2010). DataStager: Scalable data staging services for petascale applications. Cluster Computing. doi:10.1007/s10586-010-0135-6
- Kluyver T, Ragan-Kelley B, Pérez F (2016). Jupyter Notebooks—a publishing format for reproducible computational workflows. doi:10.3233/978-1-61499-649-1-87
- Peters A, Sindrilaru E, Adde G (2015). EOS as the present and future solution for data storage at CERN. J Phys Conf Ser. doi:10.1088/1742-6596/664/4/042042
- Zhou Q (2020). 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID). doi:10.1109/CCGrid49817.2020.00-44
- Khan A, Lee C-G, Hamandawana P, Park S, Kim Y (2018). 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). doi:10.1109/MASCOTS.2018.00016
- Li H, Dong M, Liao X (2015). Deduplication-based energy efficient storage system in cloud environment. The Computer Journal. doi:10.1093/comjnl/bxu122
- Lin B, Zhu F, Zhang J (2019). A time-driven data placement strategy for a scientific workflow combining edge computing and cloud computing. IEEE Transactions on Industrial Informatics. doi:10.1109/TII.2019.2905659
- Schneider J, Seidel S, Basalla M (2023). Reuse, reduce, support: Design principles for green data mining. Business & Information Systems Engineering. doi:10.1007/s12599-022-00780-w
- Garrido J, Darriba L, Sánchez-Expósito S (2021). Toward a Spanish SKA Regional Centre fully engaged with open science. J Astron Telesc Instrum Syst. doi:10.1117/1.JATIS.8.1.011004
- Bonnarel F, Salgado J, Allen M, Barnsley R, Baumann M, Boch T, Jacques A, Seaman R, Gandilo N (2025). Astronomical Data Analysis Software and Systems XXXIII. doi:10.26624/XPTJ8476
- Comrie A, Wang K-S, Hwang Y-H (2024). CARTA: The Cube Analysis and Rendering Tool for Astronomy (4.1.0) [Software]. Zenodo. doi:10.5281/zenodo.15172686
- Ponelat JS, Rosenstock LL (2022). Designing APIs with swagger and OpenAPI (424).
- Voron F (2023). Building data science applications with FastAPI.
- Parra-Royón M (2025). manuparra/preparedata-tests: SRCNet preparedata tests v1.0.2 [Software]. Zenodo. doi:10.5281/zenodo.17909872