Published August 4, 2025
| Version v1
Conference paper
Open
Supporting the AI Research Lifecycle with Kubernetes- and HPC-Backed Infrastructure in NFDI4DataScience Landscape
Authors/Creators
- 1. Center for Interdisciplinary Digital Sciences (CIDS), Dresden University of Technology (TUD)
Contributors
Editor (2):
- 1. Nationale Forschungsdateninfrastruktur (NFDI) e.V.
- 2. University of Amsterdam
Description
NFDI project emerged to manage massive amounts of data as a common good that is produced daily across various disciplines. Without proper management layers, this data would not be findable, accessible, interoperable, and reproducible (FAIR), leading to a waste of resources. The project consists of 26 consortia, covering research data across diverse scientific disciplines, from cultural and social sciences to engineering, life sciences, and natural sciences. NFDI4DataScience's vision, as part of this initiative, is to support the entire lifecycle of interdisciplinary research data in Data Science and Artificial Intelligence. It will promote FAIR and open research data infrastructures supporting all involved digital artifacts such as code, models, data, or publications through an integrated approach. Considering the recent paradigm shift in computational methods and deep learning-based approaches, this consortium has an extraordinary importance. Without a robust, orchestrated infrastructure, ensuring reproducibility and accessibility across multiple institutions becomes nearly impossible. More than 10 institutions are actively contributing to the 6 task areas of the NFDI4DataScience consortium. As a part of Task Area 3, TU Dresden provides critical computing infrastructure to support consortium partners. This infrastructure includes both Kubernetes and High-Performance Computing (HPC) clusters, enabling efficient computation for various research needs. Our Kubernetes infrastructure includes GPU-enabled nodes for training large AI models, automated certificate handling with cert-manager, monitoring with Grafana/Prometheus, Network File System (NFS) and S3 Object Storage as storage backend, and many other components. In our poster, we are showcasing an overview of what have been done in this regard, and our future plans. Access to the infrastructure is provided through different solutions: GitLab Runners, RYAX platform, the GitOps solution ArgoCD, and the command line. Through ArgoCD, we enable a GitOps-based deployment workflow that ensures reproducibility, traceability, and self-healing of deployed services by frequently synchronizing applications with the user's repository. For authentication and access control in ArgoCD, we integrated RegApp, a Community AAI solution provided by IAM4NFDI, a core service from Base4NFDI, enabling federated identity management for consortium partners. Our infrastructure integrates with continuous integration/continuous deployment (CI/CD) pipelines, utilizing GitLab Runners to automate the build, testing, and deployment of services. Partners can also leverage our HPC infrastructure via GitLab Runner, enabling researchers to quickly and consistently deploy new versions of models and tools, thus accelerating development cycles. Another way to access the computational infrastructure is RYAX, a DataOps platform that enables data teams to deploy, run, and scale their models on our Kubernetes infrastructure, offloading heavy computations to the HPC. Our infrastructure has already been adopted by different use cases within the consortium and beyond. Moving forward, we are actively working on improving the interoperability between our Kubernetes clusters and HPC infrastructure by leveraging live migration techniques to create a seamless environment for distributed computing. This improvement will enable workloads to transition smoothly between the Kubernetes and HPC environments, optimizing resource allocation and computational efficiency. To this end, we are working on building a solution for migrating workloads between our Kubernetes infrastructure and the HPC cluster to enable more flexible and efficient use of computational resources.
Files
CoRDI_2025_paper_34.pdf
Files
(61.0 kB)
| Name | Size | Download all |
|---|---|---|
|
md5:2c1487fdb13717f0cd3a3f1c5003ff3a
|
61.0 kB | Preview Download |