Dataset Open Access

Antarex HPC Fault Dataset

Alessio Netti; Zeynep Kiziltan; Ozalp Babaoglu; Alina Sirbu; Andrea Bartolini; Andrea Borghesi

Citation Style Language JSON Export

{
"publisher": "Zenodo",
"DOI": "10.5281/zenodo.2553224",
"language": "eng",
"title": "Antarex HPC Fault Dataset",
"issued": {
"date-parts": [
[
2018,
10,
10
]
]
},
"abstract": "<p>The Antarex dataset contains trace data collected from the&nbsp;homonymous experimental HPC system located at ETH Zurich&nbsp;while it was subjected to fault injection,&nbsp;for the purpose of conducting machine learning-based fault detection studies for&nbsp;HPC systems. Acquiring our own dataset was made necessary&nbsp;by the fact that commercial HPC system operators are very&nbsp;reluctant to share trace data containing information about&nbsp;faults in their systems.</p>\n\n<p>In order to acquire data, we executed benchmark applications and at the same time injected faults in the system&nbsp;at specific times via dedicated programs, so as to trigger&nbsp;anomalies in the behaviour of the applications. A wide range of faults is covered in our dataset, from hardware faults, to misconfiguration faults, and finally to performance anomalies cause by interference from other processes.&nbsp;This was achieved through the FINJ fault injection tool, developed by the authors.</p>\n\n<p>The dataset contains two types of data: one type of&nbsp;data&nbsp;refers to a series of CSV files, each containing a set of system performance metrics sampled through the LDMS&nbsp;HPC monitoring framework. Another type refers to the log&nbsp;files detailing the status of the system (i.e., currently running&nbsp;benchmark applications or injected fault programs) at each&nbsp;time point in the dataset. Such a structure enables researchers&nbsp;to perform a wide range of studies on the dataset. Moreover,&nbsp;since we collected the dataset by streaming continuous data,&nbsp;any study based on it will easily be reproducible on a real&nbsp;HPC system, in an online way.&nbsp;The dataset is divided in two parts:&nbsp;the first&nbsp;includes&nbsp;only the CPU and memory-related benchmark applications&nbsp;and fault programs, while the second is strictly hard drive-related. We executed each part in both single-core and multi-core variants, resulting in a total of 4 dataset blocks for 32 days of data acquisition, and 20GB of uncompressed data.</p>\n\n<p>For a detailed analysis on&nbsp;the structure and features of the Antarex dataset, please refer to the&nbsp;research paper&nbsp;&quot;Online Fault Classification in HPC System through Machine Learning&quot;, by Netti et al. Additional details can be found in the research paper &quot;FINJ: a Fault Injection Tool for HPC System&quot; by Netti et al., whereas all source code can be found on the GitHub repository of the FINJ tool.</p>\n\n<p>When using this dataset, please cite the two reference papers above as follows:</p>\n\n<p>&quot;&nbsp;Netti A., Kiziltan Z., Babaoglu O., S&icirc;rbu A., Bartolini A., Borghesi A. (2019) FINJ: A Fault Injection Tool for HPC Systems. In: Mencagli G. et al. (eds) Euro-Par 2018: Parallel Processing Workshops. Euro-Par 2018. Lecture Notes in Computer Science, vol 11339. Springer, Cham&quot;</p>\n\n<p>&quot; Netti A., Kiziltan Z., Babaoglu O., S&icirc;rbu A., Bartolini A., Borghesi A. (2019) Online Fault Classification in HPC Systems through Machine Learning. arXiv:1810.11208&quot;</p>",
"author": [
{
"family": "Alessio Netti"
},
{
"family": "Zeynep Kiziltan"
},
{
"family": "Ozalp Babaoglu"
},
{
"family": "Alina Sirbu"
},
{
"family": "Andrea Bartolini"
},
{
"family": "Andrea Borghesi"
}
],
"id": "2553224",
"note": "The archive contains 4 directories, one for each block of the dataset - namely CPU/Memory and HDD, in single-core and multi-core variants. In each of these directories, you will find the following: a 7z archive containing the LDMS CSV files for each of the 7 used plugins; FINJ workloads and execution logs; the histograms for the durations and inter-arrival times of fault tasks in PDF format; launch scripts, if any. Source code for all of the injected fault programs and additional details can be found on the GitHub repository of the FINJ tool.",
"version": "1.0",
"type": "dataset",
"event": "International European Conference on Parallel Processing (Euro-Par 2019)"
}
711
275
views