Published August 5, 2024 | Version v1
Poster Open

Optimizing Workflow Performance by Elucidating Semantic Data Flow

  • 1. ROR icon Illinois Institute of Technology
  • 1. ROR icon Pacific Northwest National Laboratory
  • 2. ROR icon Illinois Institute of Technology

Description

The combination of ever-growing scientific datasets and the complexity of distributed workflows creates I/O performance bottlenecks due to data volume, velocity, and variety. While the increasing use of descriptive data formats (e.g., HDF5) helps organize these datasets, it also introduces obscure bottlenecks by requiring the translation of high-level operations into file addresses and subsequent low-level I/O operations. To address this challenge, we propose using Semantic Dataflow Graphs to analyze (a) relationships between logical datasets and file addresses, (b) how dataset operations translate into different I/O behaviors and their performance, and (c) the time-ordered relationship of tasks and data across entire workflows. Our analysis and visualization enable the identification of performance bottlenecks and reasoning about performance improvements in workflows.

Files

HUG_24_Semantic_Data_Flow_Poster__Horizontal_.pdf

Files (2.1 MB)

Additional details

Related works

Is supplemented by
Video/Audio: https://youtu.be/3dY-V4O3Mf8 (URL)

Funding

Orchestration for Distributed & Data-Intensive Scientific Exploration Office of Advanced Scientific Computing Research
United States Department of Energy
A High-Performance Storage Infrastructure for Activity and Log Workloads CSSI-2104013
U.S. National Science Foundation