Dataset for Investigating Anomalies in Compute Clusters
Authors/Creators
-
McSpadden, Diana
(Researcher)1
-
Yasir, Alanazi
(Researcher)1
-
Hess, Bryan
(Project leader)1
- Hild, Laura (Data collector)1
- Jones, Mark (Data collector)1
- Lu, Yiyang (Researcher)2
- Mohammed, Ahmed (Researcher)1
- Moore, Wesley (Data collector)1
-
Ren, Jie
(Researcher)2
-
Schram, Malachi
(Researcher)1
-
Smirni, Evgenia
(Researcher)2
Description
Abstract
The dataset was collected for 332 compute nodes throughout May 19 - 23, 2023. May 19 - 22 characterizes normal compute cluster behavior, while May 23 includes an anomalous event. The dataset includes eight CPU, 11 disk, 47 memory, and 22 Slurm metrics. It represents five distinct hardware configurations and contains over one million records, totaling more than 180GB of raw data.
Background
Motivated by the goal to develop a digital twin of a compute cluster, the dataset was collected using a Prometheus server (1) scraping the Thomas Jefferson National Accelerator Facility (JLab) batch cluster used to run an assortment of physics analysis and simulation jobs, where analysis workloads leverage data generated from the laboratory's electron accelerator, and simulation workloads generate large amounts of flat data that is then carved to verify amplitudes. Metrics were scraped from the cluster throughout May 19 - 23, 2023. Data from May 19 to May 22 primarily reflected normal system behavior, while May 23, 2023, recorded a notable anomaly. This anomaly was severe enough to necessitate intervention by JLab IT Operations staff.
The metrics were collected from CPU, disk, memory, and Slurm. Metrics related to CPU, disk, and memory provide insights into the status of individual compute nodes. Furthermore, Slurm metrics collected from the network have the capability to detect anomalies that may propagate to compute nodes executing the same job.
Usage Notes
While the data from May 19 - 22 characterizes normal compute cluster behavior, and May 23 includes anomalous observations, the dataset cannot be considered labeled data. The set of nodes and the exact start and end time affected nodes demonstrate abnormal effects are unclear. Thus, the dataset could be used to develop unsupervised machine-learning algorithms to detect anomalous events in a batch cluster.
Files
Files
(23.3 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:ec9faef349a62c565c45fcbe8a1e9a6d
|
23.3 GB | Download |
Additional details
Related works
- Is documented by
- Data paper: arXiv:2311.16129 (arXiv)