Published October 7, 2022 | Version 1
Presentation Open

A Software Framework for the Orchestration of Materials Science Workflows at Large Scale

  • 1. The University of Tennessee
  • 2. The University of Utah
  • 3. Idaho National Laboratory
  • 4. MicroTesting Solutions LLC
  • 5. The Johns Hopkins University

Description

In the era of big data, materials science workflows need to handle large-scale data distribution, storage, and computation. Any thereof can become a performance bottleneck. To mitigate these bottlenecks, we present a software framework for complex analyses of internal material structures (e.g., cracks). We study the benefits of our framework for a workflow performing synchrotron X-ray computed tomography reconstruction and segmentation of Silica. This workflow comprises five stages: downloading images from X-ray tomography scans (6.0 GB per image); performing image segmentation to remove noises; performing image reconstruction to obtain a clear vision of the material structure; converting the data into OpenVisus formats; and using the OpenVisus streaming services for user analysis. During the first deployment of our framework, we learned that intermediary and outcome data can be at least 30 times larger than the inputs; and the computation during images' reconstruction and segmentation can be heavily resource-demanding. We also observed a set of challenges when dealing with data storage and management as well as scaling up computation resources. Our framework aims to offer an out-of-box solution to the aforementioned challenges through a multi-layer software structure comprising three layers. A low layer provides resource management and job scheduling on heterogeneous nodes (i.e., GPU and CPU nodes). At the core of this layer, Kubernetes provides resource management and Dask enables large-scale job scheduling for heterogeneous resources. A middle layer uses Ansible to manage the execution environment. A higher layer serves as the Jupyter interface. The contributions of our work are four-fold: through our framework, we hide the complexity of the software stack to the user who otherwise is required to have expertise in HPC/cloud; we manage job scheduling efficiently; we enable resource elasticity and workflow orchestration at a large scale; and we enable moving HPC applications to Cloud. Preliminary results show that reconstruction running on 4 GPU nodes exhibit improved performance by 72%.
 

Files

eScience_NZhou.pdf

Files (1.8 MB)

Name Size Download all
md5:30caa83f5856a8ac90201763e1abb26c
1.8 MB Preview Download