Published April 15, 2023
| Version v1
Dataset
Open
HotCloudPerf'23 ML Job Failures Analysis
Creators
Description
A cleaned job dataset collected from SLURM.
Files
slurm_data_2022_cleaned.csv
Files
(450.8 MB)
Name | Size | Download all |
---|---|---|
md5:1f6f8ebc856cbf9602169b977a44c7fe
|
450.8 MB | Preview Download |
Additional details
Related works
- Is compiled by
- Conference paper: https://doi.org/10.1145/3578245.3584726 (URL)
Software
- Repository URL
- https://github.com/chuxiaoyu/2023-hotcloudperf-ml-failures
References
- Xiaoyu Chu, Sacheendra Talluri, Laurens Versluis, and Alexandru Iosup. 2023. How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster. In Companion of the 2023 ACM/SPEC International Conference on Performance Engineering (ICPE '23 Companion). Association for Computing Machinery, New York, NY, USA, 263–268. https://doi.org/10.1145/3578245.3584726